On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
To further improve "out of the box" performance I would really also like to fix the core analyzers, when possible, to re-use a single Token instance for each Token they produce. This would then mean no objects are created as you step through Tokens in the TokenStream ... so this would give the best performance.
How much better I wonder? Small object allocation & reclaiming is supposed to be very good in current JVMs. You got good performance out of reusing Field objects, i think, partially because of the convoluted logic-tree in the constructor, including interning the field name. Token is a much lighter weight class, with no unpredictable branches, less state (except we increased it with the termBuffer stuff ;-) and no string interning (which presumably includes synchronization).
The problem is, this would be a sneaky API change and would for example break anyone who gathers the Tokens into an array and processes them later (eg maybe highlighter does?).
I think there is probably quite a bit of code out there that needs to look at tokens in context (and thus does some sort of buffering).
Maybe one way to migrate to this would be to add a new method "Token nextDirect()" to TokenStream which is allowed to return a Token that may be recycled. This means callers of nextDirect() must make their own copy of the Token if they intend to keep it for later usage. It would default to "return next()" and I would then default "next()" to call nextDirect() but make a full copy of what's returned. DocumentsWriter would use this to step through the tokens. Analyzers would then implement either next() or nextDirect(). We would fix all core analyzers to implemente nextDirect(), and then all other analyzers would remain backwards compatible. From a caller's standpoint the only difference between nextDirect() and next() is that next() guarantees to give you a "full private copy" of the Token but nextDirect() does not. We could also punt on this entirely since it is always possible for people to make their own analyzers that re-use Tokens, but I think having decent "out of the box" performance with Lucene is important. Ie, Lucene's defaults should be set up so that you get decent performance without changing anything; you should not have to work hard to get good performance and unfortunately today you still do....
I'd be surprised if we can gain meaningful performance with token reuse in a typical full-text analysis chain (stemming, stop-words, etc). Have you done any measurements, or is it a hunch at this point?
Hmmm ... is it too late to make all of Token's package protected attrs private, so we require use of getters?
They are package protected, which means that they are safe to change because we own all the code in that package. Another thing that would slow down slightly with termBuffer use is hash lookups (because Strings cache their hashCode). So an anlalyzer with a synonym filter and a stop filter would end up calculating that twice (or constructing a String). I don't know if it would be worth maintaining that in the Token or not... -Yonik --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]