On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
To further improve "out of the box" performance I would really also
like to fix the core analyzers, when possible, to re-use a single
Token instance for each Token they produce.  This would then mean no
objects are created as you step through Tokens in the TokenStream
... so this would give the best performance.

How much better I wonder?  Small object allocation & reclaiming is
supposed to be very good in current JVMs.
You got good performance out of reusing Field objects, i think,
partially because of the convoluted logic-tree in the constructor,
including interning the field name.

Token is a much lighter weight class, with no unpredictable branches,
less state (except we increased it with the termBuffer stuff ;-) and
no string interning (which presumably includes synchronization).

The problem is, this would be a sneaky API change and would for
example break anyone who gathers the Tokens into an array and
processes them later (eg maybe highlighter does?).

I think there is probably quite a bit of code out there that needs to
look at tokens in context (and thus does some sort of buffering).

Maybe one way to migrate to this would be to add a new method "Token
nextDirect()" to TokenStream which is allowed to return a Token that
may be recycled.  This means callers of nextDirect() must make their
own copy of the Token if they intend to keep it for later usage.  It
would default to "return next()" and I would then default "next()" to
call nextDirect() but make a full copy of what's
returned. DocumentsWriter would use this to step through the tokens.

Analyzers would then implement either next() or nextDirect().  We
would fix all core analyzers to implemente nextDirect(), and then all
other analyzers would remain backwards compatible.

From a caller's standpoint the only difference between nextDirect()
and next() is that next() guarantees to give you a "full private copy"
of the Token but nextDirect() does not.

We could also punt on this entirely since it is always possible for
people to make their own analyzers that re-use Tokens, but I think
having decent "out of the box" performance with Lucene is important.
Ie, Lucene's defaults should be set up so that you get decent
performance without changing anything; you should not have to work
hard to get good performance and unfortunately today you still do....

I'd be surprised if we can gain meaningful performance with token
reuse in a typical full-text analysis chain (stemming, stop-words,
etc).   Have you done any measurements, or is it a hunch at this
point?

Hmmm ... is it too late to make all of Token's package protected attrs
private, so we require use of getters?

They are package protected, which means that they are safe to change
because we own all the code in that package.

Another thing that would slow down slightly with termBuffer use is
hash lookups (because Strings cache their hashCode).  So an anlalyzer
with a synonym filter and a stop filter would end up calculating that
twice (or constructing a String).  I don't know if it would be worth
maintaining that in the Token or not...

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to