Savitha, I may have been evil into suggesting that page with body: > This is a test page. > We'd put some English words. > Some typos as well: Eglish. > Monday Tuesday Thursday Monday Monday Monday > Et un peu de français pour embêter le monde. > And a little greek: lambda in greek: λαμβδα
I think this is a pathological case and we could ignore it. Why are you saying that "in this case I could use the multilingual analyzer"? The stemmer you suggest below is very likely to have unexpected issues I have the impression. However a "neutral text field" (I called it a multilingual field) would make sense: no analysis beyond token-separation and lowercasing. A dismax configuration would prefer a match in the neutral-text-field (thus preferring unstemmed matches) to a stemmed match. What do others feel? Would it be useful to employ a strategy that would work for many languages within the same page as opposed to a language per translation? thanks in advance Paul Le 5 juil. 2012 à 04:27, savitha sundaramurthy a écrit : > Hello Paul, > > I completely understand your point. But I'm wondering on > indexing a wiki page which has multiple languages in it. > For eg: > http://ec2-50-19-181-163.compute-1.amazonaws.com:8080/xwiki/bin/view/SearchTest/ARandomPage > > I'm thinking of a way to find the list of languages used in the page and if > more than two language exist , I could use a multilingual field type. > > Sample configuration snippet: > > title_ml, space_ml, fulltext_ml, ml for multilingual. > > <!-- Multilingual --> > <fieldType name="text_ml" class="solr.TextField" positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.StandardTokenizerFactory"/> > <!-- removes l', etc --> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements=" > true"/> > <filter class="solr.EnglishMinimalStemFilterFactory"/> > <filter class="solr.FrenchLightStemFilterFactory"/> > <filter class="solr.SpanishLightStemFilterFactory"/> > </analyzer> > </fieldType> > > The list of analysers should match the languages supported by XWiki > instance. > > If the possibility of language detection tool is ruled out, I'm quite lost > on how to find if a XWiki document has two or more language in it( not > referring to translation of the Wiki page). > > Thanks a lot, > Savitha S. > > On Wed, Jul 4, 2012 at 12:21 AM, Paul Libbrecht <[email protected]> wrote: > >> Savitha, >> >> Multilingual pages are expected to be made of document translations: each >> of the page content is in one language which the author indicates and your >> indexer can read. This should be your primary source of language detection >> and you should not need an automatic language detector which is highly >> error-prone. >> >> Your analyzers seem to be correct and I feel it is correct to index >> languages in different fields. >> I would recommend that you also use a default-text field (text_intl) which >> is only mildly tokenized (whitespace, lowercase, ...) and that you add >> search into this field with much lower boost. >> >> As you say, you need "pre-processing of queries": I call this query >> expansion but whatever the name I fully agree this is a necessary step, and >> one that is insufficiently documented (on the solr side) and one that >> should be subclassable by applications. >> >> A part of it which is nicely documented is the Edismax qf parameters. It >> can contain, for example: >> title_en^3 title_fr^2 title_es^1.8 title_intl^1.7 text_en^1.5 >> text_fr^1.4 text_es^1.3 text_intl^1 >> you configure it into the solrconfig.xml which should also be adjustable I >> think. >> >> I am still fearing that facetting by language is going to fail because you >> need to consider an XWiki page in multiple language as multiple documents >> in the search results which the user does not want (and which would break >> the principle of being a translation). >> >> Paul >> >> >> >> >> >> >> Le 4 juil. 2012 à 07:05, savitha sundaramurthy a écrit : >> >>> Hi devs, >>> >>> Here are my thoughts on the configuration for multi lingual support. >>> >>> Solr uses different analysers and stemmers to index wiki content. This is >>> configured in a XML file, schema.xml. >>> >>> The wiki content with english language is indexed with text_en field type >>> whereas french with text_fr field type. The language of the document is >>> fetched and appended to the field. ( fieldName +"_"+ language : title_en, >>> fulltext_en, space_en ). >>> >>> Configurations below: >>> >>> <!-- English --> >>> <fieldType name="text_en" class="solr.TextField" >>> positionIncrementGap="100"> >>> <analyzer type="index"> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>> words="stopwords.txt" enablePositionIncrements="true" /> >>> <filter class="solr.SynonymFilterFactory" >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.EnglishMinimalStemFilterFactory"/> >>> </analyzer> >>> </fieldType> >>> >>> <!-- French --> >>> <fieldType name="text_fr" class="solr.TextField" >> positionIncrementGap="100"> >>> <analyzer> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>> <!-- removes l', etc --> >>> <filter class="solr.ElisionFilterFactory" ignoreCase="true" >>> articles="lang/contractions_fr.txt"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>> words="lang/stopwords_fr.txt" format="snowball" >>> enablePositionIncrements="true"/> >>> <filter class="solr.FrenchLightStemFilterFactory"/> >>> <!-- less aggressive: <filter >>> class="solr.FrenchMinimalStemFilterFactory"/> --> >>> <!-- more aggressive: <filter >> class="solr.SnowballPorterFilterFactory" >>> language="French"/> --> >>> </analyzer> >>> </fieldType> >>> >>> >>> In the case of a document having multilingual text, say english and >> french. >>> There is no way to find the list of languages used in the document. >>> Is it good to use a language detection tool, >>> http://code.google.com/p/language-detection/ to get the list of >> languages, >>> if they are more than two use a multilingual field type ? >>> >>> title_ml, space_ml, fulltext_ml, ml for multilingual. >>> >>> <!-- Multilingual --> >>> <fieldType name="text_ml" class="solr.TextField" >> positionIncrementGap="100"> >>> <analyzer> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>> <!-- removes l', etc --> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>> words="lang/stopwords_fr.txt" format="snowball" >>> enablePositionIncrements="true"/> >>> <filter class="solr.EnglishMinimalStemFilterFactory"/> >>> <filter class="solr.FrenchLightStemFilterFactory"/> >>> <filter class="solr.SpanishLightStemFilterFactory"/> >>> </analyzer> >>> </fieldType> >>> >>> The list of analysers should match the languages supported by XWik >> instance. >>> >>> Am planning to use language detection only to check whether text from >>> multiple languages exist. Will investigate if its possible to configure >> the >>> analysers on the fly based on the languages returned by the >>> language-detection tool. >>> >>> Please suggest,if this is a right approach ? >>> >>> -- >>> Thanks, >>> Savitha.s >>> _______________________________________________ >>> devs mailing list >>> [email protected] >>> http://lists.xwiki.org/mailman/listinfo/devs >> >> _______________________________________________ >> devs mailing list >> [email protected] >> http://lists.xwiki.org/mailman/listinfo/devs >> > > > > -- > Thanks, > Savi > _______________________________________________ > devs mailing list > [email protected] > http://lists.xwiki.org/mailman/listinfo/devs _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

