[ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554533 ]
Doron Cohen commented on LUCENE-1101: ------------------------------------- {quote} I think it's used for both tokenized and un-tokenized.... see line1319. It seems redundant to call clear() in both the consumer (DocumentsWriter) and producer (Tokenizer). {quote} You're right again Yonik, I missed line 1319. But I think it would be cleaner/safer to move the responsibility to clear() from consumers to producers. (Producer being the deepest tokenstream in the call sequence, the one that would instantiate a new Token if it implemented next()). Otherwise you get bugs like the one I had in testStopPositons() in the patch for LUCENE-1095: The test chains two stop filters: * a = WhiltSpaceAnalyzer(). * f1 = StopFilter(a) * f2 = StopFilter(f1) Now the test itself calls next(). StopFilter implements only next(Token). So this is the sequence we get: * test call f2.next() * TokenSteam next() calls t2.next(new Token()) * t2.next(r) calls t1.next(r) repeatedly (until r not stopped). * t1.next(r) calls a.ts.next(r) repeatedly (until r not stopped). The cause for the bug was that when t1 returns a token r, it may have set r's pos_incr to something other than1. But when t2 calls t1 again (because t2 stopped r), that pos_incr should have bean cleared to 1. Now this can also be fixed by changing StopFilter to clear() before every call to t1.next(r), except for the first time it calls ts.next(), because for the first call it can assume that its consumer already cleared r. Since most words are not stopped, this distinction between first call to successive calls is important, performance wise. Now, this is a little complicated (and not only because of my writing style : - ) ), and so I think moving the clear() responsibility to the (deep most) producer would make things more simple and safe. (?) > Tokenizers should reset positionIncrement to 1 in their next(Token result) > --------------------------------------------------------------------------- > > Key: LUCENE-1101 > URL: https://issues.apache.org/jira/browse/LUCENE-1101 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 2.3 > Reporter: Doron Cohen > Assignee: Doron Cohen > Fix For: 2.3 > > Attachments: lucene-1101.patch, lucene-1101.patch > > > Tokenizers which implement the reuse form of the next method: > next(Token result) > should reset the postionIncrement of the returned token to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]