[jira] Commented: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Doron Cohen (JIRA) Wed, 26 Dec 2007 22:39:08 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554533
 ]


Doron Cohen commented on LUCENE-1101:
-------------------------------------

{quote}
I think it's used for both tokenized and un-tokenized.... see line1319.
It seems redundant to call clear() in both the consumer (DocumentsWriter) and 
producer (Tokenizer).
{quote}

You're right again Yonik, I missed line 1319.

But I think it would be cleaner/safer to move the responsibility to clear() 
from consumers to producers.
(Producer being the deepest tokenstream in the call sequence, the one that 
would instantiate a new Token if it implemented next()).

Otherwise you get bugs like the one I had in testStopPositons() in the patch 
for LUCENE-1095: 
The test chains two stop filters:
* a = WhiltSpaceAnalyzer().
* f1 = StopFilter(a)
* f2 = StopFilter(f1)

Now the test itself calls next().
StopFilter implements only next(Token).
So this is the sequence we get:
* test call f2.next()
* TokenSteam next() calls t2.next(new Token())
* t2.next(r) calls t1.next(r) repeatedly (until r not stopped).
* t1.next(r) calls a.ts.next(r) repeatedly (until r not stopped).

The cause for the bug was that when t1 returns a token r, it may have set r's 
pos_incr to something other than1. But when t2 calls t1 again (because t2 
stopped r), that pos_incr should have bean cleared to 1. Now this can also be 
fixed by changing StopFilter to clear() before every call to t1.next(r), except 
for the first time it calls ts.next(), because for the first call it can assume 
that its consumer already cleared r. Since most words are not stopped, this 
distinction between first call to successive calls is important, performance 
wise.

Now, this is a little complicated (and not only because of my writing style : - 
) ), 
and so I think moving the clear() responsibility to the (deep most) producer 
would make things more simple and safe. (?)


> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch, lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Reply via email to