Sure, and it matches with my original understanding about what should be done. I got doubts when I saw the modules directory with snowball in it.
Lucene appears to support a lot more languages out of the box. Are there plans for Lucy to support other languages? (even if Lucy has no formal notion of language support, this is an important part of IR). Thanks for all your help so far. Em ter, 31 de mar de 2015 às 13:42, Marvin Humphrey <[email protected]> escreveu: > On Tue, Mar 31, 2015 at 5:32 AM, Bruno Albuquerque <[email protected]> > wrote: > > > On a related question, Lucy relies on Snowball for language support > > (normalization, stemming, stopwords) but snowball has a very limited set > of > > languages it supports. What would be the best way to add support for new > > languages? > > There's no canonical form of "language support" in Lucy. There are only > Analyzers which happen to be tuned for content in a specific language. > > What Analyzers do is tokenize and normalize content. You start with a > Unicode > text string. Let's say it's the following: > > Eats, Shoots and Leaves. > > If you perform no analysis, the only search which will match that field is > the exact term query `Eats, Shoots and Leaves.` -- because there's only one > entry in the term dictionary and that's it. > > # Tokens produced by analysis chain and stored in index: > ['Eats, Shoots and Leaves.'] > > If you use an Analyzer which only splits on whitespace, you will become > able > to search for individual terms, but your searches will be case-sensitive > and > punctuation will get in the way. For example, a search for `Leaves` will > fail > but a search for `Leaves.` will succeed. > > ['Eats,', 'Shoots', 'and', 'Leaves.'] > > If you use an Analyzer which splits on whitespace and is intelligent about > removing punctuation, that problem is solved. > > ['Eats', 'Shoots', 'and', 'Leaves'] > > If you add case folding to the analysis chain, then searches for both > `leaves` > and `Leaves` will succeed. > > ['eats', 'shoots', 'and', 'leaves'] > > (Note that no matter which Analyzer you use, the same transform must be > applied at search time in order to match.) > > If you add an English Snowball stemmer, then searches for both `leaves` and > `leave` will match (though not `leaf`, which stems to `leaf` using Snowball > EN). > > ['eat', 'shoot', 'and', 'leave'] > > So... to implement "language support" for another language, you need to > create > an Analyzer which implements a Transform() method which applies > tokenization and normalization appropriate for that language. > > Does that make sense? > > Marvin Humphrey >
