"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > > OK, I ran some benchmarks here. > > > > > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and > > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all > > > > Wikipedia content using LowerCaseTokenizer + StopFilter + > > > > PorterStemFilter. I think it's worth pursuing! > > > > > > Did you try it w/o token reuse (reuse tokens only when mutating, not > > > when creating new tokens from the tokenizer)? > > > > I haven't tried this variant yet. I guess for long filter chains the > > GC cost of the tokenizer making the initial token should go down as > > overall part of the time. Though I think we should still re-use the > > initial token since it should (?) only help. > > If it weren't any slower, that would be great... but I worry about > filters that need buffering (either on the input side or the output > side) and how that interacts with filters that try and reuse.
OK I will tease out this effect & measure performance impact. This would mean that the tokenizer must not only produce new Token instance for each term but also cannot re-use the underlying char[] buffer in that token, right? EG with mods for CharTokenizer I re-use its "char[] buffer" with every Token, but I'll change that to be a new buffer for each Token for this test. > Tokens that do output buffering could be slowed down if they must copy > the token state to the passed token. I like Doron's idea that a new > Token could be returned anyway. > > The extra complexity seemingly involved in trying to make both > scenarios perform well is what prompts me to ask what the performance > gain will be. Yes I like Doron's idea too -- it's just a "suggestion" to use the input Token if it's convenient. I think the resulting API is fairly simple with this change: if you (the consumer) want a "full private copy" of the Token (like QueryParser, Highlighter, CachedTokenFilter, a filter that does input buffering, etc.) you call the input.next() call. If instead you can handle re-use because you will consume this Token once, right now, and never look at it again (like DocumentsWriter), then you call the next(Token) API. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]