Hi Paul, On Thu, Jul 5, 2012 at 9:21 PM, Paul Libbrecht <[email protected]> wrote:
> Savitha, > > I may have been evil into suggesting that page with body: > > This is a test page. > > We'd put some English words. > > Some typos as well: Eglish. > > Monday Tuesday Thursday Monday Monday Monday > > Et un peu de français pour embêter le monde. > > And a little greek: lambda in greek: λαμβδα > > I think this is a pathological case and we could ignore it. > I agree that this is not a representative use case. Better would be to create the same page in English only, with a couple translations. > Why are you saying that "in this case I could use the multilingual > analyzer"? > The stemmer you suggest below is very likely to have unexpected issues I > have the impression. > > However a "neutral text field" (I called it a multilingual field) would > make sense: no analysis beyond token-separation and lowercasing. A dismax > configuration would prefer a match in the neutral-text-field (thus > preferring unstemmed matches) to a stemmed match. > > What do others feel? > Would it be useful to employ a strategy that would work for many languages > within the same page as opposed to a language per translation? > Given the scope of a GSoC, I'd say no. The 2 use cases I see on projects are the following: - Wiki in one language -> use the right stemmer (if the wiki is setup in French, use the French stemmer by default) - Wiki with multilingual activated -> search documents that match the context language (with the right stemmer obviously) and let the user expand to other languages if no match is found in context language The several-languages-in-one-page use case has been pretty much inexistent in my experience. Guillaume thanks in advance > > Paul > > > Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit : > > > Hello Paul, > > > > I completely understand your point. But I'm wondering on > > indexing a wiki page which has multiple languages in it. > > For eg: > > > http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/SearchTest/ARandomPage > > > > I'm thinking of a way to find the list of languages used in the page and > if > > more than two language exist , I could use a multilingual field type. > > > > Sample configuration snippet: > > > > title_ml, space_ml, fulltext_ml, ml for multilingual. > > > > <!-- Multilingual --> > > <fieldType name="text_ml" class="solr.TextField" > positionIncrementGap="100"> > > <analyzer> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <!-- removes l', etc --> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="lang/stopwords_fr.txt" format="snowball" > enablePositionIncrements=" > > true"/> > > <filter class="solr.EnglishMinimalStemFilterFactory"/> > > <filter class="solr.FrenchLightStemFilterFactory"/> > > <filter class="solr.SpanishLightStemFilterFactory"/> > > </analyzer> > > </fieldType> > > > > The list of analysers should match the languages supported by XWiki > > instance. > > > > If the possibility of language detection tool is ruled out, I'm quite > lost > > on how to find if a XWiki document has two or more language in it( not > > referring to translation of the Wiki page). > > > > Thanks a lot, > > Savitha S. > > > > On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <[email protected]> > wrote: > > > >> Savitha, > >> > >> Multilingual pages are expected to be made of document translations: > each > >> of the page content is in one language which the author indicates and > your > >> indexer can read. This should be your primary source of language > detection > >> and you should not need an automatic language detector which is highly > >> error-prone. > >> > >> Your analyzers seem to be correct and I feel it is correct to index > >> languages in different fields. > >> I would recommend that you also use a default-text field (text_intl) > which > >> is only mildly tokenized (whitespace, lowercase, ...) and that you add > >> search into this field with much lower boost. > >> > >> As you say, you need "pre-processing of queries": I call this query > >> expansion but whatever the name I fully agree this is a necessary step, > and > >> one that is insufficiently documented (on the solr side) and one that > >> should be subclassable by applications. > >> > >> A part of it which is nicely documented is the Edismax qf parameters. It > >> can contain, for example: > >> title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5 > >> text_fr^1.4 text_es^1.3 text_intl^1 > >> you configure it into the solrconfig.xml which should also be > adjustable I > >> think. > >> > >> I am still fearing that facetting by language is going to fail because > you > >> need to consider an XWiki page in multiple language as multiple > documents > >> in the search results which the user does not want (and which would > break > >> the principle of being a translation). > >> > >> Paul > >> > >> > >> > >> > >> > >> > >> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit : > >> > >>> Hi devs, > >>> > >>> Here are my thoughts on the configuration for multi lingual support. > >>> > >>> Solr uses different analysers and stemmers to index wiki content. This > is > >>> configured in a XML file, schema.xml. > >>> > >>> The wiki content with english language is indexed with text_en field > type > >>> whereas french with text_fr field type. The language of the document is > >>> fetched and appended to the field. ( fieldName +"_"+ language : > title_en, > >>> fulltext_en, space_en ). > >>> > >>> Configurations below: > >>> > >>> <!-- English --> > >>> <fieldType name="text_en" class="solr.TextField" > >>> positionIncrementGap="100"> > >>> <analyzer type="index"> > >>> <tokenizer class="solr.StandardTokenizerFactory"/> > >>> <filter class="solr.StopFilterFactory" ignoreCase="true" > >>> words="stopwords.txt" enablePositionIncrements="true" /> > >>> <filter class="solr.SynonymFilterFactory" > >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.EnglishMinimalStemFilterFactory"/> > >>> </analyzer> > >>> </fieldType> > >>> > >>> <!-- French --> > >>> <fieldType name="text_fr" class="solr.TextField" > >> positionIncrementGap="100"> > >>> <analyzer> > >>> <tokenizer class="solr.StandardTokenizerFactory"/> > >>> <!-- removes l', etc --> > >>> <filter class="solr.ElisionFilterFactory" ignoreCase="true" > >>> articles="lang/contractions_fr.txt"/> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.StopFilterFactory" ignoreCase="true" > >>> words="lang/stopwords_fr.txt" format="snowball" > >>> enablePositionIncrements="true"/> > >>> <filter class="solr.FrenchLightStemFilterFactory"/> > >>> <!-- less aggressive: <filter > >>> class="solr.FrenchMinimalStemFilterFactory"/> --> > >>> <!-- more aggressive: <filter > >> class="solr.SnowballPorterFilterFactory" > >>> language="French"/> --> > >>> </analyzer> > >>> </fieldType> > >>> > >>> > >>> In the case of a document having multilingual text, say english and > >> french. > >>> There is no way to find the list of languages used in the document. > >>> Is it good to use a language detection tool, > >>> http://code.google.com/p/language-detection/ to get the list of > >> languages, > >>> if they are more than two use a multilingual field type ? > >>> > >>> title_ml, space_ml, fulltext_ml, ml for multilingual. > >>> > >>> <!-- Multilingual --> > >>> <fieldType name="text_ml" class="solr.TextField" > >> positionIncrementGap="100"> > >>> <analyzer> > >>> <tokenizer class="solr.StandardTokenizerFactory"/> > >>> <!-- removes l', etc --> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.StopFilterFactory" ignoreCase="true" > >>> words="lang/stopwords_fr.txt" format="snowball" > >>> enablePositionIncrements="true"/> > >>> <filter class="solr.EnglishMinimalStemFilterFactory"/> > >>> <filter class="solr.FrenchLightStemFilterFactory"/> > >>> <filter class="solr.SpanishLightStemFilterFactory"/> > >>> </analyzer> > >>> </fieldType> > >>> > >>> The list of analysers should match the languages supported by XWik > >> instance. > >>> > >>> Am planning to use language detection only to check whether text from > >>> multiple languages exist. Will investigate if its possible to configure > >> the > >>> analysers on the fly based on the languages returned by the > >>> language-detection tool. > >>> > >>> Please suggest,if this is a right approach ? > >>> > >>> -- > >>> Thanks, > >>> Savitha.s > >>> _______________________________________________ > >>> devs mailing list > >>> [email protected] > >>> http://lists.xwiki.org/mailman/listinfo/devs > >> > >> _______________________________________________ > >> devs mailing list > >> [email protected] > >> http://lists.xwiki.org/mailman/listinfo/devs > >> > > > > > > > > -- > > Thanks, > > Savi > > _______________________________________________ > > devs mailing list > > [email protected] > > http://lists.xwiki.org/mailman/listinfo/devs > > _______________________________________________ > devs mailing list > [email protected] > http://lists.xwiki.org/mailman/listinfo/devs > _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

