PolyAnalyzer.pm

Marvin Humphrey Mon, 05 Dec 2011 13:44:40 -0800

Hi, Nick,

Awesome stuff coming through on the new Lucy::Analysis::StandardTokenizer!

On Mon, Dec 05, 2011 at 09:02:42PM -0000, [email protected] wrote:
>  PolyAnalyzer*
>  PolyAnalyzer_new(const CharBuf *language, VArray *analyzers) {
> @@ -43,7 +43,7 @@ PolyAnalyzer_init(PolyAnalyzer *self, co
>      else if (language) {
>          self->analyzers = VA_new(3);
>          VA_Push(self->analyzers, (Obj*)CaseFolder_new());
> -        VA_Push(self->analyzers, (Obj*)RegexTokenizer_new(NULL));
> +        VA_Push(self->analyzers, (Obj*)StandardTokenizer_new());
>          VA_Push(self->analyzers, (Obj*)SnowStemmer_new(language));
>      }

This will cause a backwards compatibility break.  I really want to make your
StandardTokenizer the default, but I think we might want to go about it
differently.

How about we leave PolyAnalyzer alone, but add a new class called
"EasyAnalyzer", with the following default stack:

    1. StandardTokenizer
    2. Normalizer
    3. SnowballStemmer

This integrates both your recent contributions, plus changes the order to be
avoid the Highlighter problems you identified and be more in line with the
potential refactoring you talked about.

It would be nice to benchmark this just to see what sort of performance impact
changing the order has before we finalize it.

If this works out, we can then swap out PolyAnalyzer for EasyAnalyzer
throughout the tutorial and other high-level documentation.

Marvin Humphrey

[lucy-dev] Re: [lucy-commits] svn commit: r1210630 - in /incubator/lucy/branches/LUCY-196-uax-tokenizer: core/Lucy/Analysis/PolyAnalyzer.c core/Lucy/Analysis/PolyAnalyzer.cfh perl/lib/Lucy/Analysis/PolyAnalyzer.pm

Reply via email to