Re: Token termBuffer issues

eks dev Sat, 21 Jul 2007 09:31:52 -0700

as a side note, 
Maybe a good point to mention it, clever people from mg4j project solved this 
polymorphic needs for char[]/String (fast modification/compact, fast hashing, 
safety) using something called MutableString, it is worth reading javadoc 
there.

----- Original Message ----
From: eks dev <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Saturday, 21 July, 2007 6:23:25 PM
Subject: Re: Token termBuffer issues

On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>> To further improve "out of the box" performance I would really also
>> like to fix the core analyzers, when possible, to re-use a single
>> Token instance for each Token they produce.  This would then mean no
>> objects are created as you step through Tokens in the TokenStream
>> ... so this would give the best performance.

>How much better I wonder?  Small object allocation & reclaiming is
>supposed to be very good in current JVMs.

Sorry I cannot give you exact numbers now, but I know for sure that we decided 
to take "real analysis" into separate phase that gets executed before entering 
Lucene TokenStreram and Indexing due to String in Token and than do just the 
simple whitespace tokenisation during indexing. And this was not just out for 
fun, there was some real benefit in it.  
The issue with performance here is in making transformations on tokens during 
analysis (where this applies), you gave  very nice example , stemming, that 
itself  generates  new  Strings, another nice example is NGram generation in 
SpellChecker that generates rater huge numbers of small objects. 

The simplest model, tokenize(without modifying)/index ironically also benefits 
from char[] as than things go really fast in general  so new String() on the 
way  gets noticed  in profiler.  While testing new indexing code from Mike, we 
also changed  our vanilla Tokenizer to use termBuffer and there was again 
something like 10-15% boost.

It's been a while since that so I do not know exact numbers, but I learned this 
many times the hard way, nothing beats char[] when it comes to text processing. 

To stop bothering you people, IMHO, there is a hard work in Analyzer chain to 
be done  before Token gets ready for prime time in Lucene core, and this is the 
place where having String overproduction hurts. 

      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token termBuffer issues

Reply via email to