On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > > OK, I ran some benchmarks here.
> > > >
> > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
> > > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > > PorterStemFilter. I think it's worth pursuing!
> > >
> > > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > > when creating new tokens from the tokenizer)?
> >
> > I haven't tried this variant yet. I guess for long filter chains the
> > GC cost of the tokenizer making the initial token should go down as
> > overall part of the time. Though I think we should still re-use the
> > initial token since it should (?) only help.
>
> If it weren't any slower, that would be great... but I worry about
> filters that need buffering (either on the input side or the output
> side) and how that interacts with filters that try and reuse.
OK I will tease out this effect & measure performance impact.
This would mean that the tokenizer must not only produce new Token
instance for each term but also cannot re-use the underlying char[]
buffer in that token, right?
If the tokenizer can actually change the contents of the char[], then
yes, it seems like when next() is called rather than next(Token), a
new char[] would need to be allocated.
EG with mods for CharTokenizer I re-use
its "char[] buffer" with every Token, but I'll change that to be a new
buffer for each Token for this test.
It's not just for a test, right? If next() is called, it can't reuse
the char[]. And there is no getting around the fact that some
tokenizers will need to call next() because of buffering.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]