[ https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010934#comment-17010934 ]
Elbek Kamoliddinov commented on LUCENE-9100: -------------------------------------------- I made some progress on this, I found a case where following text influences whatever text it follows, For example this text {{日本語【単話版】}} produces following tokens: {code:java} 日本 日本語 語 単 話 版 {code} But with this text {{日本語 単話版}} {code:java} 日本語 単 話 版 {code} This char {{【}} influences how first 3 chars are tokenized, but the char itself is punctuation char. Wouldn't it make sense to treat punctuation chars as a token breaker and cut its influence? The code I used to produce the tokens: {code:java} Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { Tokenizer tokenizer = new JapaneseTokenizer(newAttributeFactory(), null, true, JapaneseTokenizer.Mode.SEARCH); return new TokenStreamComponents(tokenizer, new LowerCaseFilter(tokenizer)); } }; try (TokenStream tokens = analyzer.tokenStream("field", new StringReader("日本語 単話版"))) { CharTermAttribute chars = tokens.addAttribute(CharTermAttribute.class); tokens.reset(); while (tokens.incrementToken()) { System.out.println(chars.toString()); } tokens.end(); } catch (IOException e) { // should never happen with a StringReader throw new RuntimeException(e); } {code} > JapaneseTokenizer produces inconsistent tokens > ---------------------------------------------- > > Key: LUCENE-9100 > URL: https://issues.apache.org/jira/browse/LUCENE-9100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 7.2 > Reporter: Elbek Kamoliddinov > Priority: Major > > We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. > With this text: > {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert > space before `【` char. Here is the small code snippet demonstrating the case > (note we use our own dictionary and connection costs): > {code:java} > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String > fieldName) { > // Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), null, true, > JapaneseTokenizer.Mode.SEARCH); > Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, > dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, > JapaneseTokenizer.Mode.SEARCH); > return new TokenStreamComponents(tokenizer, new > LowerCaseFilter(tokenizer)); > } > }; > String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; > String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space > try (TokenStream tokens = analyzer.tokenStream("field", new > StringReader(text1))) { > CharTermAttribute chars = > tokens.addAttribute(CharTermAttribute.class); > tokens.reset(); > while (tokens.incrementToken()) { > System.out.println(chars.toString()); > } > tokens.end(); > } catch (IOException e) { > // should never happen with a StringReader > throw new RuntimeException(e); > } {code} > Output is: > {code:java} > //text1 > マギ > アリス > 単 > 版 > 話 > 4 > 話 > unlimited > comics > //text2 > マギア > リス > 単 > 版 > 話 > 4 > 話 > unlimited > comics{code} > It looks like tokenizer doesn't view the punctuation ({{【}} is > {{Character.START_PUNCTUATION}} type) as an indicator that there should be a > token break, and somehow 【 punctuation char causes difference in the output. > If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't > manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and > outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org