[ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Sekiguchi updated LUCENE-2207: ----------------------------------- Attachment: TestCJKOffset.java Attached the program that reproduces the problem. In the program, I didn't use FastVectorHighlighter, instead, I printed out offsets from TermVectorOffsetInfo. You'll see the following results: {code} === WhitespaceAnalyzer === あい(0,2) うえお(3,6) === CJKAnalyzer === あい(0,2) うえ(4,6) えお(5,7) === BasicNGramAnalyzer === あい(0,2) うえ(3,5) えお(4,6) {code} For people who are seeing garbage characters, I want to rephrase using 'Cn' symbols as follows: {code} === WhitespaceAnalyzer === C1C2(0,2) C3C4C5(3,6) === CJKAnalyzer === C1C2(0,2) C3C4(4,6) C4C5(5,7) === BasicNGramAnalyzer === C1C2(0,2) C3C4(3,5) C4C5(4,6) {code} As you can see, the start offset of 'C3' is 3 in WhitespaceAnalyzer and BasicNGramAnalyzer (an analyzer which uses BasicNGramTokenizer. BasicNGramTokenizer is used in FastVectorHighlighter test code. It works as a 2-gram tokenizer for not only CJK but also ASCII), but is 4 in CJKAnalyzer -- incorrect! I'll look into it tomorrow or after, but volunteers are welcome! > CJKTokenizer generates tokens with incorrect offsets > ---------------------------------------------------- > > Key: LUCENE-2207 > URL: https://issues.apache.org/jira/browse/LUCENE-2207 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers > Reporter: Koji Sekiguchi > Attachments: TestCJKOffset.java > > > If I index a Japanese *multi-valued* document with CJKTokenizer and highlight > a term with FastVectorHighlighter, the output snippets have incorrect > highlighted string. I'll attach a program that reproduces the problem soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org