Re: [xwiki-devs] [GSoC] Solr multilingual support.

Guillaume Lerouge Fri, 06 Jul 2012 01:45:26 -0700

Hi Paul,

On Thu, Jul 5, 2012 at 9:21 PM, Paul Libbrecht <[email protected]> wrote:


> Savitha,
>
> I may have been evil into suggesting that page with body:
> > This is a test page.
> > We'd put some English words.
> > Some typos as well: Eglish.
> > Monday Tuesday Thursday Monday Monday Monday
> > Et un peu de français pour embêter le monde.
> > And a little greek: lambda in greek: λαμβδα
>
> I think this is a pathological case and we could ignore it.
>

I agree that this is not a representative use case. Better would be to
create the same page in English only, with a couple translations.


> Why are you saying that "in this case I could use the multilingual
> analyzer"?
> The stemmer you suggest below is very likely to have unexpected issues I
> have the impression.
>
> However a "neutral text field" (I called it a multilingual field) would
> make sense: no analysis beyond token-separation and lowercasing. A dismax
> configuration would prefer a match in the neutral-text-field (thus
> preferring unstemmed matches) to a stemmed match.
>
> What do others feel?
> Would it be useful to employ a strategy that would work for many languages
> within the same page as opposed to a language per translation?
>

Given the scope of a GSoC, I'd say no. The 2 use cases I see on projects
are the following:

   - Wiki in one language -> use the right stemmer (if the wiki is setup in
   French, use the French stemmer by default)
   - Wiki with multilingual activated -> search documents that match the
   context language (with the right stemmer obviously) and let the user expand
   to other languages if no match is found in context language

The several-languages-in-one-page use case has been pretty much inexistent
in my experience.

Guillaume

thanks in advance
>
> Paul
>
>
> Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit :
>
> > Hello Paul,
> >
> >            I completely understand your point. But I'm wondering on
> > indexing a wiki page which has multiple languages in it.
> > For eg:
> >
> http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/SearchTest/ARandomPage
> >
> > I'm thinking of a way to find the list of languages used in the page and
> if
> > more than two language exist , I could use a multilingual field type.
> >
> > Sample configuration snippet:
> >
> > title_ml, space_ml, fulltext_ml, ml for multilingual.
> >
> > <!-- Multilingual -->
> > <fieldType name="text_ml" class="solr.TextField"
> positionIncrementGap="100">
> > <analyzer>
> >     <tokenizer class="solr.StandardTokenizerFactory"/>
> >     <!-- removes l', etc -->
> >     <filter class="solr.LowerCaseFilterFactory"/>
> >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="lang/stopwords_fr.txt" format="snowball"
> enablePositionIncrements="
> > true"/>
> >     <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >     <filter class="solr.FrenchLightStemFilterFactory"/>
> >     <filter class="solr.SpanishLightStemFilterFactory"/>
> >  </analyzer>
> > </fieldType>
> >
> > The list of analysers should match the languages supported by XWiki
> > instance.
> >
> > If the possibility of language detection tool is ruled out, I'm quite
> lost
> > on how to find if a XWiki document has two or more language in it( not
> > referring to translation of the Wiki page).
> >
> > Thanks a lot,
> > Savitha S.
> >
> > On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <[email protected]>
> wrote:
> >
> >> Savitha,
> >>
> >> Multilingual pages are expected to be made of document translations:
> each
> >> of the page content is in one language which the author indicates and
> your
> >> indexer can read. This should be your primary source of language
> detection
> >> and you should not need an automatic language detector which is highly
> >> error-prone.
> >>
> >> Your analyzers seem to be correct and I feel it is correct to index
> >> languages in different fields.
> >> I would recommend that you also use a default-text field (text_intl)
> which
> >> is only mildly tokenized (whitespace, lowercase, ...) and that you add
> >> search into this field with much lower boost.
> >>
> >> As you say, you need "pre-processing of queries": I call this query
> >> expansion but whatever the name I fully agree this is a necessary step,
> and
> >> one that is insufficiently documented (on the solr side) and one that
> >> should be subclassable by applications.
> >>
> >> A part of it which is nicely documented is the Edismax qf parameters. It
> >> can contain, for example:
> >>  title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5
> >> text_fr^1.4 text_es^1.3 text_intl^1
> >> you configure it into the solrconfig.xml which should also be
> adjustable I
> >> think.
> >>
> >> I am still fearing that facetting by language is going to fail because
> you
> >> need to consider an XWiki page in multiple language as multiple
> documents
> >> in the search results which the user does not want (and which would
> break
> >> the principle of being a translation).
> >>
> >> Paul
> >>
> >>
> >>
> >>
> >>
> >>
> >> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :
> >>
> >>> Hi devs,
> >>>
> >>> Here are my thoughts on the configuration for multi lingual support.
> >>>
> >>> Solr uses different analysers and stemmers to index wiki content. This
> is
> >>> configured in a XML file, schema.xml.
> >>>
> >>> The wiki content with english language is indexed with text_en field
> type
> >>> whereas french with text_fr field type. The language of the document is
> >>> fetched and appended to the field. ( fieldName +"_"+ language :
> title_en,
> >>> fulltext_en, space_en ).
> >>>
> >>> Configurations below:
> >>>
> >>> <!-- English -->
> >>>   <fieldType name="text_en" class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>     <analyzer type="index">
> >>>       <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="stopwords.txt" enablePositionIncrements="true" />
> >>>       <filter class="solr.SynonymFilterFactory"
> >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >>>     </analyzer>
> >>>   </fieldType>
> >>>
> >>> <!-- French -->
> >>> <fieldType name="text_fr" class="solr.TextField"
> >> positionIncrementGap="100">
> >>> <analyzer>
> >>>    <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>    <!-- removes l', etc -->
> >>>    <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> >>> articles="lang/contractions_fr.txt"/>
> >>>    <filter class="solr.LowerCaseFilterFactory"/>
> >>>    <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="lang/stopwords_fr.txt" format="snowball"
> >>> enablePositionIncrements="true"/>
> >>>    <filter class="solr.FrenchLightStemFilterFactory"/>
> >>>    <!-- less aggressive: <filter
> >>> class="solr.FrenchMinimalStemFilterFactory"/> -->
> >>>    <!-- more aggressive: <filter
> >> class="solr.SnowballPorterFilterFactory"
> >>> language="French"/> -->
> >>> </analyzer>
> >>> </fieldType>
> >>>
> >>>
> >>> In the case of a document having multilingual text, say english and
> >> french.
> >>> There is no way to find the list of languages used in the document.
> >>> Is it good to use  a language detection tool,
> >>> http://code.google.com/p/language-detection/ to get the list of
> >> languages,
> >>> if they are more than two use a multilingual field type ?
> >>>
> >>> title_ml, space_ml, fulltext_ml, ml for multilingual.
> >>>
> >>> <!-- Multilingual -->
> >>> <fieldType name="text_ml" class="solr.TextField"
> >> positionIncrementGap="100">
> >>> <analyzer>
> >>>    <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>    <!-- removes l', etc -->
> >>>    <filter class="solr.LowerCaseFilterFactory"/>
> >>>    <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="lang/stopwords_fr.txt" format="snowball"
> >>> enablePositionIncrements="true"/>
> >>>    <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >>>    <filter class="solr.FrenchLightStemFilterFactory"/>
> >>>    <filter class="solr.SpanishLightStemFilterFactory"/>
> >>> </analyzer>
> >>> </fieldType>
> >>>
> >>> The list of analysers should match the languages supported by XWik
> >> instance.
> >>>
> >>> Am planning to use language detection only to check whether text from
> >>> multiple languages exist. Will investigate if its possible to configure
> >> the
> >>> analysers on the fly based on the languages returned by the
> >>> language-detection tool.
> >>>
> >>> Please suggest,if this is a right approach ?
> >>>
> >>> --
> >>> Thanks,
> >>> Savitha.s
> >>> _______________________________________________
> >>> devs mailing list
> >>> [email protected]
> >>> http://lists.xwiki.org/mailman/listinfo/devs
> >>
> >> _______________________________________________
> >> devs mailing list
> >> [email protected]
> >> http://lists.xwiki.org/mailman/listinfo/devs
> >>
> >
> >
> >
> > --
> > Thanks,
> > Savi
> > _______________________________________________
> > devs mailing list
> > [email protected]
> > http://lists.xwiki.org/mailman/listinfo/devs
>
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
>
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [GSoC] Solr multilingual support.

Reply via email to