[
https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Sekiguchi updated LUCENE-2207:
-----------------------------------
Attachment: TestCJKOffset.java
Attached the program that reproduces the problem. In the program, I didn't use
FastVectorHighlighter, instead, I printed out offsets from
TermVectorOffsetInfo. You'll see the following results:
{code}
=== WhitespaceAnalyzer ===
あい(0,2)
うえお(3,6)
=== CJKAnalyzer ===
あい(0,2)
うえ(4,6)
えお(5,7)
=== BasicNGramAnalyzer ===
あい(0,2)
うえ(3,5)
えお(4,6)
{code}
For people who are seeing garbage characters, I want to rephrase using 'Cn'
symbols as follows:
{code}
=== WhitespaceAnalyzer ===
C1C2(0,2)
C3C4C5(3,6)
=== CJKAnalyzer ===
C1C2(0,2)
C3C4(4,6)
C4C5(5,7)
=== BasicNGramAnalyzer ===
C1C2(0,2)
C3C4(3,5)
C4C5(4,6)
{code}
As you can see, the start offset of 'C3' is 3 in WhitespaceAnalyzer and
BasicNGramAnalyzer (an analyzer which uses BasicNGramTokenizer.
BasicNGramTokenizer is used in FastVectorHighlighter test code. It works as a
2-gram tokenizer for not only CJK but also ASCII), but is 4 in CJKAnalyzer --
incorrect!
I'll look into it tomorrow or after, but volunteers are welcome!
> CJKTokenizer generates tokens with incorrect offsets
> ----------------------------------------------------
>
> Key: LUCENE-2207
> URL: https://issues.apache.org/jira/browse/LUCENE-2207
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/analyzers
> Reporter: Koji Sekiguchi
> Attachments: TestCJKOffset.java
>
>
> If I index a Japanese *multi-valued* document with CJKTokenizer and highlight
> a term with FastVectorHighlighter, the output snippets have incorrect
> highlighted string. I'll attach a program that reproduces the problem soon.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]