Re: [xwiki-devs] [GSoC] Solr multilingual support.

savitha sundaramurthy Wed, 04 Jul 2012 19:28:11 -0700

Hello Paul,

            I completely understand your point. But I'm wondering on
indexing a wiki page which has multiple languages in it.
For eg:
http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/SearchTest/ARandomPage


I'm thinking of a way to find the list of languages used in the page and if
more than two language exist , I could use a multilingual field type.

Sample configuration snippet:

title_ml, space_ml, fulltext_ml, ml for multilingual.

<!-- Multilingual -->
<fieldType name="text_ml" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <!-- removes l', etc -->
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements="
true"/>
     <filter class="solr.EnglishMinimalStemFilterFactory"/>
     <filter class="solr.FrenchLightStemFilterFactory"/>
     <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>

The list of analysers should match the languages supported by XWiki
instance.

If the possibility of language detection tool is ruled out, I'm quite lost
on how to find if a XWiki document has two or more language in it( not
referring to translation of the Wiki page).

Thanks a lot,
Savitha S.

On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <[email protected]> wrote:

> Savitha,
>
> Multilingual pages are expected to be made of document translations: each
> of the page content is in one language which the author indicates and your
> indexer can read. This should be your primary source of language detection
> and you should not need an automatic language detector which is highly
> error-prone.
>
> Your analyzers seem to be correct and I feel it is correct to index
> languages in different fields.
> I would recommend that you also use a default-text field (text_intl) which
> is only mildly tokenized (whitespace, lowercase, ...) and that you add
> search into this field with much lower boost.
>
> As you say, you need "pre-processing of queries": I call this query
> expansion but whatever the name I fully agree this is a necessary step, and
> one that is insufficiently documented (on the solr side) and one that
> should be subclassable by applications.
>
> A part of it which is nicely documented is the Edismax qf parameters. It
> can contain, for example:
>   title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5
> text_fr^1.4 text_es^1.3 text_intl^1
> you configure it into the solrconfig.xml which should also be adjustable I
> think.
>
> I am still fearing that facetting by language is going to fail because you
> need to consider an XWiki page in multiple language as multiple documents
> in the search results which the user does not want (and which would break
> the principle of being a translation).
>
> Paul
>
>
>
>
>
>
> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :
>
> > Hi devs,
> >
> > Here are my thoughts on the configuration for multi lingual support.
> >
> > Solr uses different analysers and stemmers to index wiki content. This is
> > configured in a XML file, schema.xml.
> >
> > The wiki content with english language is indexed with text_en field type
> > whereas french with text_fr field type. The language of the document is
> > fetched and appended to the field. ( fieldName +"_"+ language : title_en,
> > fulltext_en, space_en ).
> >
> > Configurations below:
> >
> > <!-- English -->
> >    <fieldType name="text_en" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true" />
> >        <filter class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
> >
> > <!-- French -->
> > <fieldType name="text_fr" class="solr.TextField"
> positionIncrementGap="100">
> > <analyzer>
> >     <tokenizer class="solr.StandardTokenizerFactory"/>
> >     <!-- removes l', etc -->
> >     <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> > articles="lang/contractions_fr.txt"/>
> >     <filter class="solr.LowerCaseFilterFactory"/>
> >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="lang/stopwords_fr.txt" format="snowball"
> > enablePositionIncrements="true"/>
> >     <filter class="solr.FrenchLightStemFilterFactory"/>
> >     <!-- less aggressive: <filter
> > class="solr.FrenchMinimalStemFilterFactory"/> -->
> >     <!-- more aggressive: <filter
> class="solr.SnowballPorterFilterFactory"
> > language="French"/> -->
> >  </analyzer>
> > </fieldType>
> >
> >
> > In the case of a document having multilingual text, say english and
> french.
> > There is no way to find the list of languages used in the document.
> > Is it good to use  a language detection tool,
> > http://code.google.com/p/language-detection/ to get the list of
> languages,
> > if they are more than two use a multilingual field type ?
> >
> > title_ml, space_ml, fulltext_ml, ml for multilingual.
> >
> > <!-- Multilingual -->
> > <fieldType name="text_ml" class="solr.TextField"
> positionIncrementGap="100">
> > <analyzer>
> >     <tokenizer class="solr.StandardTokenizerFactory"/>
> >     <!-- removes l', etc -->
> >     <filter class="solr.LowerCaseFilterFactory"/>
> >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="lang/stopwords_fr.txt" format="snowball"
> > enablePositionIncrements="true"/>
> >     <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >     <filter class="solr.FrenchLightStemFilterFactory"/>
> >     <filter class="solr.SpanishLightStemFilterFactory"/>
> >  </analyzer>
> > </fieldType>
> >
> > The list of analysers should match the languages supported by XWik
> instance.
> >
> > Am planning to use language detection only to check whether text from
> > multiple languages exist. Will investigate if its possible to configure
> the
> > analysers on the fly based on the languages returned by the
> > language-detection tool.
> >
> > Please suggest,if this is a right approach ?
> >
> > --
> > Thanks,
> > Savitha.s
> > _______________________________________________
> > devs mailing list
> > [email protected]
> > http://lists.xwiki.org/mailman/listinfo/devs
>
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
>



-- 
Thanks,
Savi
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [GSoC] Solr multilingual support.

Reply via email to