Dominique Bejean <[email protected]> wrote: > Hi, > > During a recent Solr project we needed to index document in a lot of > languages. The natural solution with Lucene and Solr is to define one > field per languages. Each field is configured in the schema.xml file > to use a language specific processing (tokenizing, stop words, > stemmer, ...). This is really not easy to manage if you have a lot of > languages and this means that 1) the search interface need to know in > which language your are searching 2) the search interface can't search > in all languages at the same time. > > So, I decided that the only solution was to index all languages in > only one field. > > Obviously, each language needs to be processed specifically. For this, > I developped a analyzer that is in charge to redirect content to the > correct tockenizer, filters and stemmer accordingly to its > language. This analyzer is also used at query time. If the user > specify the language of its query, the query is processed by > appropriate tockenizer, filters and stemmer otherwise the query is > processed by a defaut tockenizer, filters and stemmer.
I'm not sure how much this helps. My query processing is the same as yours, but I only index the document with a single analyzer, based on the language determination. With your approach, multiple analyses are all mixed together in a single field, so I'd expect a lower precision score, due to words that accidentally stem to the same root in multiple different languages. Bill > > With this solution : > > 1. I only need one field (or two if I want both stemmed and unstemmed > processing) > 2. The user can search in all document regarless to there language > > I hope this help. > > Dominique > www.zoonix.fr > www.crawl-anywhere.com > > > > Le 20/01/11 00:29, Bill Janssen a écrit : > > Paul Libbrecht<[email protected]> wrote: > > > >> I did several changes of this sort and the precision and recall > >> measures went better in particular in presence of language-indication > >> failure which happened to be very common in our authoring environment. > > There are two kinds of failures: no language, or wrong language. > > > > For no language, I fall back to StandardAnalyzer, so I should have > > results similar to yours. For wrong language, well, I'm using OTS > > trigram-based language guessers, and they're pretty good these days. > > > >>>> Wouldn't it be better to prefer precise matches (a field that is > >>>> analyzed with StandardAnalyzer for example) but also allow matches are > >>>> stemmed. > > Yes, I think it might improve things, but again, by how much? Stemming is > > better than no stemming, in terms of recall. But this approach would also > > improve precision. > > > > Bill > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
