Re: [xwiki-devs] [GSoC] Solr multilingual support.

Paul Libbrecht Wed, 04 Jul 2012 00:21:44 -0700

Savitha,

Multilingual pages are expected to be made of document translations: each of 
the page content is in one language which the author indicates and your indexer 
can read. This should be your primary source of language detection and you 
should not need an automatic language detector which is highly error-prone.


Your analyzers seem to be correct and I feel it is correct to index languages 
in different fields.
I would recommend that you also use a default-text field (text_intl) which is 
only mildly tokenized (whitespace, lowercase, ...) and that you add search into 
this field with much lower boost.

As you say, you need "pre-processing of queries": I call this query expansion 
but whatever the name I fully agree this is a necessary step, and one that is 
insufficiently documented (on the solr side) and one that should be 
subclassable by applications.

A part of it which is nicely documented is the Edismax qf parameters. It can 
contain, for example:
  title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5 text_fr^1.4 
text_es^1.3 text_intl^1
you configure it into the solrconfig.xml which should also be adjustable I 
think.

I am still fearing that facetting by language is going to fail because you need 
to consider an XWiki page in multiple language as multiple documents in the 
search results which the user does not want (and which would break the 
principle of being a translation).

Paul






Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :

> Hi devs,
> 
> Here are my thoughts on the configuration for multi lingual support.
> 
> Solr uses different analysers and stemmers to index wiki content. This is
> configured in a XML file, schema.xml.
> 
> The wiki content with english language is indexed with text_en field type
> whereas french with text_fr field type. The language of the document is
> fetched and appended to the field. ( fieldName +"_"+ language : title_en,
> fulltext_en, space_en ).
> 
> Configurations below:
> 
> <!-- English -->
>    <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishMinimalStemFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> <!-- French -->
> <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
> <analyzer>
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <!-- removes l', etc -->
>     <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> articles="lang/contractions_fr.txt"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_fr.txt" format="snowball"
> enablePositionIncrements="true"/>
>     <filter class="solr.FrenchLightStemFilterFactory"/>
>     <!-- less aggressive: <filter
> class="solr.FrenchMinimalStemFilterFactory"/> -->
>     <!-- more aggressive: <filter class="solr.SnowballPorterFilterFactory"
> language="French"/> -->
>  </analyzer>
> </fieldType>
> 
> 
> In the case of a document having multilingual text, say english and french.
> There is no way to find the list of languages used in the document.
> Is it good to use  a language detection tool,
> http://code.google.com/p/language-detection/ to get the list of languages,
> if they are more than two use a multilingual field type ?
> 
> title_ml, space_ml, fulltext_ml, ml for multilingual.
> 
> <!-- Multilingual -->
> <fieldType name="text_ml" class="solr.TextField" positionIncrementGap="100">
> <analyzer>
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <!-- removes l', etc -->
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_fr.txt" format="snowball"
> enablePositionIncrements="true"/>
>     <filter class="solr.EnglishMinimalStemFilterFactory"/>
>     <filter class="solr.FrenchLightStemFilterFactory"/>
>     <filter class="solr.SpanishLightStemFilterFactory"/>
>  </analyzer>
> </fieldType>
> 
> The list of analysers should match the languages supported by XWik instance.
> 
> Am planning to use language detection only to check whether text from
> multiple languages exist. Will investigate if its possible to configure the
> analysers on the fly based on the languages returned by the
> language-detection tool.
> 
> Please suggest,if this is a right approach ?
> 
> -- 
> Thanks,
> Savitha.s
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs

_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [GSoC] Solr multilingual support.

Reply via email to