Re: Token termBuffer issues

Yonik Seeley Tue, 24 Jul 2007 15:10:10 -0700

On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > OK, I ran some benchmarks here.
> >
> > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > 17.2% speedup using Sun's JDK 6, on Linux.  This is indexing all
> > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > PorterStemFilter.  I think it's worth pursuing!
>
> Did you try it w/o token reuse (reuse tokens only when mutating, not
> when creating new tokens from the tokenizer)?


I haven't tried this variant yet.  I guess for long filter chains the
GC cost of the tokenizer making the initial token should go down as
overall part of the time.  Though I think we should still re-use the
initial token since it should (?) only help.


If it weren't any slower, that would be great... but I worry about
filters that need buffering (either on the input side or the output
side) and how that interacts with filters that try and reuse.

Tokens that do output buffering could be slowed down if they must copy
the token state to the passed token.  I like Doron's idea that a new
Token could be returned anyway.

The extra complexity seemingly involved in trying to make both
scenarios perform well is what prompts me to ask what the performance
gain will be.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token termBuffer issues

Reply via email to