as a side note, Maybe a good point to mention it, clever people from mg4j project solved this polymorphic needs for char[]/String (fast modification/compact, fast hashing, safety) using something called MutableString, it is worth reading javadoc there.
----- Original Message ---- From: eks dev <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Saturday, 21 July, 2007 6:23:25 PM Subject: Re: Token termBuffer issues On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote: >> To further improve "out of the box" performance I would really also >> like to fix the core analyzers, when possible, to re-use a single >> Token instance for each Token they produce. This would then mean no >> objects are created as you step through Tokens in the TokenStream >> ... so this would give the best performance. >How much better I wonder? Small object allocation & reclaiming is >supposed to be very good in current JVMs. Sorry I cannot give you exact numbers now, but I know for sure that we decided to take "real analysis" into separate phase that gets executed before entering Lucene TokenStreram and Indexing due to String in Token and than do just the simple whitespace tokenisation during indexing. And this was not just out for fun, there was some real benefit in it. The issue with performance here is in making transformations on tokens during analysis (where this applies), you gave very nice example , stemming, that itself generates new Strings, another nice example is NGram generation in SpellChecker that generates rater huge numbers of small objects. The simplest model, tokenize(without modifying)/index ironically also benefits from char[] as than things go really fast in general so new String() on the way gets noticed in profiler. While testing new indexing code from Mike, we also changed our vanilla Tokenizer to use termBuffer and there was again something like 10-15% boost. It's been a while since that so I do not know exact numbers, but I learned this many times the hard way, nothing beats char[] when it comes to text processing. To stop bothering you people, IMHO, there is a hard work in Analyzer chain to be done before Token gets ready for prime time in Lucene core, and this is the place where having String overproduction hurts. ___________________________________________________________ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___________________________________________________________ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]