Re: [xwiki-devs] [GSoC] Solr multilingual support.

savitha sundaramurthy Fri, 06 Jul 2012 12:34:53 -0700

 Paul and Guillaume,

                        Thanks for pointing it out. And also the idea
of neutral
text field looks good. I'm implementing it.


On Fri, Jul 6, 2012 at 1:45 AM, Guillaume Lerouge <[email protected]>wrote:

> Hi Paul,
>
> On Thu, Jul 5, 2012 at 9:21 PM, Paul Libbrecht <[email protected]> wrote:
>
> > Savitha,
> >
> > I may have been evil into suggesting that page with body:
> > > This is a test page.
> > > We'd put some English words.
> > > Some typos as well: Eglish.
> > > Monday Tuesday Thursday Monday Monday Monday
> > > Et un peu de français pour embêter le monde.
> > > And a little greek: lambda in greek: λαμβδα
> >
> > I think this is a pathological case and we could ignore it.
> >
>
> I agree that this is not a representative use case. Better would be to
> create the same page in English only, with a couple translations.
>
>
> > Why are you saying that "in this case I could use the multilingual
> > analyzer"?
> > The stemmer you suggest below is very likely to have unexpected issues I
> > have the impression.
> >
> > However a "neutral text field" (I called it a multilingual field) would
> > make sense: no analysis beyond token-separation and lowercasing. A dismax
> > configuration would prefer a match in the neutral-text-field (thus
> > preferring unstemmed matches) to a stemmed match.
> >
> > What do others feel?
> > Would it be useful to employ a strategy that would work for many
> languages
> > within the same page as opposed to a language per translation?
> >
>
> Given the scope of a GSoC, I'd say no. The 2 use cases I see on projects
> are the following:
>
>    - Wiki in one language -> use the right stemmer (if the wiki is setup in
>    French, use the French stemmer by default)
>    - Wiki with multilingual activated -> search documents that match the
>    context language (with the right stemmer obviously) and let the user
> expand
>    to other languages if no match is found in context language
>
> The several-languages-in-one-page use case has been pretty much inexistent
> in my experience.
>
> Guillaume
>
> thanks in advance
> >
> > Paul
> >
> >
> > Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit :
> >
> > > Hello Paul,
> > >
> > >            I completely understand your point. But I'm wondering on
> > > indexing a wiki page which has multiple languages in it.
> > > For eg:
> > >
> >
> http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/SearchTest/ARandomPage
> > >
> > > I'm thinking of a way to find the list of languages used in the page
> and
> > if
> > > more than two language exist , I could use a multilingual field type.
> > >
> > > Sample configuration snippet:
> > >
> > > title_ml, space_ml, fulltext_ml, ml for multilingual.
> > >
> > > <!-- Multilingual -->
> > > <fieldType name="text_ml" class="solr.TextField"
> > positionIncrementGap="100">
> > > <analyzer>
> > >     <tokenizer class="solr.StandardTokenizerFactory"/>
> > >     <!-- removes l', etc -->
> > >     <filter class="solr.LowerCaseFilterFactory"/>
> > >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="lang/stopwords_fr.txt" format="snowball"
> > enablePositionIncrements="
> > > true"/>
> > >     <filter class="solr.EnglishMinimalStemFilterFactory"/>
> > >     <filter class="solr.FrenchLightStemFilterFactory"/>
> > >     <filter class="solr.SpanishLightStemFilterFactory"/>
> > >  </analyzer>
> > > </fieldType>
> > >
> > > The list of analysers should match the languages supported by XWiki
> > > instance.
> > >
> > > If the possibility of language detection tool is ruled out, I'm quite
> > lost
> > > on how to find if a XWiki document has two or more language in it( not
> > > referring to translation of the Wiki page).
> > >
> > > Thanks a lot,
> > > Savitha S.
> > >
> > > On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <[email protected]>
> > wrote:
> > >
> > >> Savitha,
> > >>
> > >> Multilingual pages are expected to be made of document translations:
> > each
> > >> of the page content is in one language which the author indicates and
> > your
> > >> indexer can read. This should be your primary source of language
> > detection
> > >> and you should not need an automatic language detector which is highly
> > >> error-prone.
> > >>
> > >> Your analyzers seem to be correct and I feel it is correct to index
> > >> languages in different fields.
> > >> I would recommend that you also use a default-text field (text_intl)
> > which
> > >> is only mildly tokenized (whitespace, lowercase, ...) and that you add
> > >> search into this field with much lower boost.
> > >>
> > >> As you say, you need "pre-processing of queries": I call this query
> > >> expansion but whatever the name I fully agree this is a necessary
> step,
> > and
> > >> one that is insufficiently documented (on the solr side) and one that
> > >> should be subclassable by applications.
> > >>
> > >> A part of it which is nicely documented is the Edismax qf parameters.
> It
> > >> can contain, for example:
> > >>  title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5
> > >> text_fr^1.4 text_es^1.3 text_intl^1
> > >> you configure it into the solrconfig.xml which should also be
> > adjustable I
> > >> think.
> > >>
> > >> I am still fearing that facetting by language is going to fail because
> > you
> > >> need to consider an XWiki page in multiple language as multiple
> > documents
> > >> in the search results which the user does not want (and which would
> > break
> > >> the principle of being a translation).
> > >>
> > >> Paul
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit :
> > >>
> > >>> Hi devs,
> > >>>
> > >>> Here are my thoughts on the configuration for multi lingual support.
> > >>>
> > >>> Solr uses different analysers and stemmers to index wiki content.
> This
> > is
> > >>> configured in a XML file, schema.xml.
> > >>>
> > >>> The wiki content with english language is indexed with text_en field
> > type
> > >>> whereas french with text_fr field type. The language of the document
> is
> > >>> fetched and appended to the field. ( fieldName +"_"+ language :
> > title_en,
> > >>> fulltext_en, space_en ).
> > >>>
> > >>> Configurations below:
> > >>>
> > >>> <!-- English -->
> > >>>   <fieldType name="text_en" class="solr.TextField"
> > >>> positionIncrementGap="100">
> > >>>     <analyzer type="index">
> > >>>       <tokenizer class="solr.StandardTokenizerFactory"/>
> > >>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> > >>> words="stopwords.txt" enablePositionIncrements="true" />
> > >>>       <filter class="solr.SynonymFilterFactory"
> > >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> > >>>       <filter class="solr.LowerCaseFilterFactory"/>
> > >>>       <filter class="solr.EnglishMinimalStemFilterFactory"/>
> > >>>     </analyzer>
> > >>>   </fieldType>
> > >>>
> > >>> <!-- French -->
> > >>> <fieldType name="text_fr" class="solr.TextField"
> > >> positionIncrementGap="100">
> > >>> <analyzer>
> > >>>    <tokenizer class="solr.StandardTokenizerFactory"/>
> > >>>    <!-- removes l', etc -->
> > >>>    <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> > >>> articles="lang/contractions_fr.txt"/>
> > >>>    <filter class="solr.LowerCaseFilterFactory"/>
> > >>>    <filter class="solr.StopFilterFactory" ignoreCase="true"
> > >>> words="lang/stopwords_fr.txt" format="snowball"
> > >>> enablePositionIncrements="true"/>
> > >>>    <filter class="solr.FrenchLightStemFilterFactory"/>
> > >>>    <!-- less aggressive: <filter
> > >>> class="solr.FrenchMinimalStemFilterFactory"/> -->
> > >>>    <!-- more aggressive: <filter
> > >> class="solr.SnowballPorterFilterFactory"
> > >>> language="French"/> -->
> > >>> </analyzer>
> > >>> </fieldType>
> > >>>
> > >>>
> > >>> In the case of a document having multilingual text, say english and
> > >> french.
> > >>> There is no way to find the list of languages used in the document.
> > >>> Is it good to use  a language detection tool,
> > >>> http://code.google.com/p/language-detection/ to get the list of
> > >> languages,
> > >>> if they are more than two use a multilingual field type ?
> > >>>
> > >>> title_ml, space_ml, fulltext_ml, ml for multilingual.
> > >>>
> > >>> <!-- Multilingual -->
> > >>> <fieldType name="text_ml" class="solr.TextField"
> > >> positionIncrementGap="100">
> > >>> <analyzer>
> > >>>    <tokenizer class="solr.StandardTokenizerFactory"/>
> > >>>    <!-- removes l', etc -->
> > >>>    <filter class="solr.LowerCaseFilterFactory"/>
> > >>>    <filter class="solr.StopFilterFactory" ignoreCase="true"
> > >>> words="lang/stopwords_fr.txt" format="snowball"
> > >>> enablePositionIncrements="true"/>
> > >>>    <filter class="solr.EnglishMinimalStemFilterFactory"/>
> > >>>    <filter class="solr.FrenchLightStemFilterFactory"/>
> > >>>    <filter class="solr.SpanishLightStemFilterFactory"/>
> > >>> </analyzer>
> > >>> </fieldType>
> > >>>
> > >>> The list of analysers should match the languages supported by XWik
> > >> instance.
> > >>>
> > >>> Am planning to use language detection only to check whether text from
> > >>> multiple languages exist. Will investigate if its possible to
> configure
> > >> the
> > >>> analysers on the fly based on the languages returned by the
> > >>> language-detection tool.
> > >>>
> > >>> Please suggest,if this is a right approach ?
> > >>>
> > >>> --
> > >>> Thanks,
> > >>> Savitha.s
> > >>> _______________________________________________
> > >>> devs mailing list
> > >>> [email protected]
> > >>> http://lists.xwiki.org/mailman/listinfo/devs
> > >>
> > >> _______________________________________________
> > >> devs mailing list
> > >> [email protected]
> > >> http://lists.xwiki.org/mailman/listinfo/devs
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Savi
> > > _______________________________________________
> > > devs mailing list
> > > [email protected]
> > > http://lists.xwiki.org/mailman/listinfo/devs
> >
> > _______________________________________________
> > devs mailing list
> > [email protected]
> > http://lists.xwiki.org/mailman/listinfo/devs
> >
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
>



-- 
Thanks,
Savi
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [GSoC] Solr multilingual support.

Reply via email to