[ 
https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000568#comment-17000568
 ] 

Kazuaki Hiraga commented on LUCENE-9100:
----------------------------------------

[~elbek....@gmail.com] OK. Now I understood why I was not able to reproduce 
your consequences. Then, the tokenization results are depending on how to 
generate your custom dictionary. Can you print part-of-speech tags and other 
attributes with tokenized tokens? If Katakana characters don't have readings, 
they may not in your dictionary. So you can add them to the source of your 
custom system dictionary or just add to the user dictionary to see the outcome. 

> JapaneseTokenizer produces inconsistent tokens
> ----------------------------------------------
>
>                 Key: LUCENE-9100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9100
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 7.2
>            Reporter: Elbek Kamoliddinov
>            Priority: Major
>
> We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. 
> With this text:
>  {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert 
> space before `【` char. Here is the small code snippet demonstrating the case 
> (note we use our own dictionary and connection costs):
> {code:java}
>         Analyzer analyzer = new Analyzer() {
>             @Override
>             protected TokenStreamComponents createComponents(String 
> fieldName) {
> //                Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), null, true, 
> JapaneseTokenizer.Mode.SEARCH);
>                 Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, 
> dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, 
> JapaneseTokenizer.Mode.SEARCH);
>                 return new TokenStreamComponents(tokenizer, new 
> LowerCaseFilter(tokenizer));
>             }
>         };
>         String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)";
>         String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space
>         try (TokenStream tokens = analyzer.tokenStream("field", new 
> StringReader(text1))) {
>             CharTermAttribute chars = 
> tokens.addAttribute(CharTermAttribute.class);
>             tokens.reset();
>             while (tokens.incrementToken()) {
>                 System.out.println(chars.toString());
>             }
>             tokens.end();
>         } catch (IOException e) {
>             // should never happen with a StringReader
>             throw new RuntimeException(e);
>         } {code}
> Output is:
> {code:java}
> //text1
>  マギ
> アリス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics
> //text2
> マギア
> リス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics{code}
> It looks like tokenizer doesn't view the punctuation ({{【}} is 
> {{Character.START_PUNCTUATION}} type) as an indicator that there should be a 
> token break, and somehow 【 punctuation char causes difference in the output. 
> If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't 
> manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and 
> outputs as is. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to