[jira] Updated: (LUCENE-2207) CJKTokenizer generates tokens with incorrect offsets

Robert Muir (JIRA) Wed, 13 Jan 2010 09:49:17 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-2207:
--------------------------------

    Attachment: LUCENE-2207.patch

ok i found the bug. the problem is incrementToken() unconditionally increments 
the offset before it starts its main loop:

line 165:
{code}
offset++;
{code}

so, when incrementToken() has no more text to return and returns false, it 
needs to subtract from this.

again i think in the future we try to refactor this offset logic to be simpler, 
but for the short term, this fixes the bug and all tests pass.

Koji, can you review?

> CJKTokenizer generates tokens with incorrect offsets
> ----------------------------------------------------
>
>                 Key: LUCENE-2207
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2207
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Koji Sekiguchi
>         Attachments: LUCENE-2207.patch, TestCJKOffset.java
>
>
> If I index a Japanese *multi-valued* document with CJKTokenizer and highlight 
> a term with FastVectorHighlighter, the output snippets have incorrect 
> highlighted string. I'll attach a program that reproduces the problem soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2207) CJKTokenizer generates tokens with incorrect offsets

Reply via email to