Re: [xwiki-devs] [GSoC] Solr multilingual support.

Paul Libbrecht Thu, 05 Jul 2012 12:21:44 -0700

Savitha,
 
I may have been evil into suggesting that page with body:
> This is a test page.
> We'd put some English words.
> Some typos as well: Eglish.
> Monday Tuesday Thursday Monday Monday Monday
> Et un peu de français pour embêter le monde.
> And a little greek: lambda in greek: λαμβδα


I think this is a pathological case and we could ignore it.

Why are you saying that "in this case I could use the multilingual analyzer"? 
The stemmer you suggest below is very likely to have unexpected issues I have 
the impression.

However a "neutral text field" (I called it a multilingual field) would make 
sense: no analysis beyond token-separation and lowercasing. A dismax 
configuration would prefer a match in the neutral-text-field (thus preferring 
unstemmed matches) to a stemmed match.

What do others feel?
Would it be useful to employ a strategy that would work for many languages 
within the same page as opposed to a language per translation?

thanks in advance

Paul


Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit :

> Hello Paul,
> 
>            I completely understand your point. But I'm wondering on
> indexing a wiki page which has multiple languages in it.
> For eg:
> http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/SearchTest/ARandomPage
> 
> I'm thinking of a way to find the list of languages used in the page and if
> more than two language exist , I could use a multilingual field type.
> 
> Sample configuration snippet:
> 
> title_ml, space_ml, fulltext_ml, ml for multilingual.
> 
> <!-- Multilingual -->
> <fieldType name="text_ml" class="solr.TextField" positionIncrementGap="100">
> <analyzer>
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <!-- removes l', etc -->
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements="
> true"/>
>     <filter class="solr.EnglishMinimalStemFilterFactory"/>
>     <filter class="solr.FrenchLightStemFilterFactory"/>
>     <filter class="solr.SpanishLightStemFilterFactory"/>
>  </analyzer>
> </fieldType>
> 
> The list of analysers should match the languages supported by XWiki
> instance.
> 
> If the possibility of language detection tool is ruled out, I'm quite lost
> on how to find if a XWiki document has two or more language in it( not
> referring to translation of the Wiki page).
> 
> Thanks a lot,
> Savitha S.
> 
> On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <[email protected]> wrote:
> 
>> Savitha,
>> 
>> Multilingual pages are expected to be made of document translations: each
>> of the page content is in one language which the author indicates and your
>> indexer can read. This should be your primary source of language detection
>> and you should not need an automatic language detector which is highly
>> error-prone.
>> 
>> Your analyzers seem to be correct and I feel it is correct to index
>> languages in different fields.
>> I would recommend that you also use a default-text field (text_intl) which
>> is only mildly tokenized (whitespace, lowercase, ...) and that you add
>> search into this field with much lower boost.
>> 
>> As you say, you need "pre-processing of queries": I call this query
>> expansion but whatever the name I fully agree this is a necessary step, and
>> one that is insufficiently documented (on the solr side) and one that
>> should be subclassable by applications.
>> 
>> A part of it which is nicely documented is the Edismax qf parameters. It
>> can contain, for example:
>>  title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5
>> text_fr^1.4 text_es^1.3 text_intl^1
>> you configure it into the solrconfig.xml which should also be adjustable I
>> think.
>> 
>> I am still fearing that facetting by language is going to fail because you
>> need to consider an XWiki page in multiple language as multiple documents
>> in the search results which the user does not want (and which would break
>> the principle of being a translation).
>> 
>> Paul
>> 
>> 
>> 
>> 
>> 
>> 
>> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :
>> 
>>> Hi devs,
>>> 
>>> Here are my thoughts on the configuration for multi lingual support.
>>> 
>>> Solr uses different analysers and stemmers to index wiki content. This is
>>> configured in a XML file, schema.xml.
>>> 
>>> The wiki content with english language is indexed with text_en field type
>>> whereas french with text_fr field type. The language of the document is
>>> fetched and appended to the field. ( fieldName +"_"+ language : title_en,
>>> fulltext_en, space_en ).
>>> 
>>> Configurations below:
>>> 
>>> <!-- English -->
>>>   <fieldType name="text_en" class="solr.TextField"
>>> positionIncrementGap="100">
>>>     <analyzer type="index">
>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>       <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.EnglishMinimalStemFilterFactory"/>
>>>     </analyzer>
>>>   </fieldType>
>>> 
>>> <!-- French -->
>>> <fieldType name="text_fr" class="solr.TextField"
>> positionIncrementGap="100">
>>> <analyzer>
>>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>>    <!-- removes l', etc -->
>>>    <filter class="solr.ElisionFilterFactory" ignoreCase="true"
>>> articles="lang/contractions_fr.txt"/>
>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>    <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="lang/stopwords_fr.txt" format="snowball"
>>> enablePositionIncrements="true"/>
>>>    <filter class="solr.FrenchLightStemFilterFactory"/>
>>>    <!-- less aggressive: <filter
>>> class="solr.FrenchMinimalStemFilterFactory"/> -->
>>>    <!-- more aggressive: <filter
>> class="solr.SnowballPorterFilterFactory"
>>> language="French"/> -->
>>> </analyzer>
>>> </fieldType>
>>> 
>>> 
>>> In the case of a document having multilingual text, say english and
>> french.
>>> There is no way to find the list of languages used in the document.
>>> Is it good to use  a language detection tool,
>>> http://code.google.com/p/language-detection/ to get the list of
>> languages,
>>> if they are more than two use a multilingual field type ?
>>> 
>>> title_ml, space_ml, fulltext_ml, ml for multilingual.
>>> 
>>> <!-- Multilingual -->
>>> <fieldType name="text_ml" class="solr.TextField"
>> positionIncrementGap="100">
>>> <analyzer>
>>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>>    <!-- removes l', etc -->
>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>    <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="lang/stopwords_fr.txt" format="snowball"
>>> enablePositionIncrements="true"/>
>>>    <filter class="solr.EnglishMinimalStemFilterFactory"/>
>>>    <filter class="solr.FrenchLightStemFilterFactory"/>
>>>    <filter class="solr.SpanishLightStemFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>> 
>>> The list of analysers should match the languages supported by XWik
>> instance.
>>> 
>>> Am planning to use language detection only to check whether text from
>>> multiple languages exist. Will investigate if its possible to configure
>> the
>>> analysers on the fly based on the languages returned by the
>>> language-detection tool.
>>> 
>>> Please suggest,if this is a right approach ?
>>> 
>>> --
>>> Thanks,
>>> Savitha.s
>>> _______________________________________________
>>> devs mailing list
>>> [email protected]
>>> http://lists.xwiki.org/mailman/listinfo/devs
>> 
>> _______________________________________________
>> devs mailing list
>> [email protected]
>> http://lists.xwiki.org/mailman/listinfo/devs
>> 
> 
> 
> 
> -- 
> Thanks,
> Savi
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs

_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [GSoC] Solr multilingual support.

Reply via email to