Hi, Nick, Awesome stuff coming through on the new Lucy::Analysis::StandardTokenizer!
On Mon, Dec 05, 2011 at 09:02:42PM -0000, [email protected] wrote: > PolyAnalyzer* > PolyAnalyzer_new(const CharBuf *language, VArray *analyzers) { > @@ -43,7 +43,7 @@ PolyAnalyzer_init(PolyAnalyzer *self, co > else if (language) { > self->analyzers = VA_new(3); > VA_Push(self->analyzers, (Obj*)CaseFolder_new()); > - VA_Push(self->analyzers, (Obj*)RegexTokenizer_new(NULL)); > + VA_Push(self->analyzers, (Obj*)StandardTokenizer_new()); > VA_Push(self->analyzers, (Obj*)SnowStemmer_new(language)); > } This will cause a backwards compatibility break. I really want to make your StandardTokenizer the default, but I think we might want to go about it differently. How about we leave PolyAnalyzer alone, but add a new class called "EasyAnalyzer", with the following default stack: 1. StandardTokenizer 2. Normalizer 3. SnowballStemmer This integrates both your recent contributions, plus changes the order to be avoid the Highlighter problems you identified and be more in line with the potential refactoring you talked about. It would be nice to benchmark this just to see what sort of performance impact changing the order has before we finalize it. If this works out, we can then swap out PolyAnalyzer for EasyAnalyzer throughout the tutorial and other high-level documentation. Marvin Humphrey
