[ https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000262#comment-17000262 ]
Elbek Kamoliddinov commented on LUCENE-9100: -------------------------------------------- [~h.kazuaki] We use our own dictionaries, sorry I wanted to say *note* but managed to drop *e.* Corrected it in the description. > JapaneseTokenizer produces inconsistent tokens > ---------------------------------------------- > > Key: LUCENE-9100 > URL: https://issues.apache.org/jira/browse/LUCENE-9100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 7.2 > Reporter: Elbek Kamoliddinov > Priority: Major > > We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. > With this text: > {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert > space before `【` char. Here is the small code snippet demonstrating the case > (note we use our own dictionary and connection costs): > {code:java} > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String > fieldName) { > // Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), null, true, > JapaneseTokenizer.Mode.SEARCH); > Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, > dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, > JapaneseTokenizer.Mode.SEARCH); > return new TokenStreamComponents(tokenizer, new > LowerCaseFilter(tokenizer)); > } > }; > String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; > String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space > try (TokenStream tokens = analyzer.tokenStream("field", new > StringReader(text1))) { > CharTermAttribute chars = > tokens.addAttribute(CharTermAttribute.class); > tokens.reset(); > while (tokens.incrementToken()) { > System.out.println(chars.toString()); > } > tokens.end(); > } catch (IOException e) { > // should never happen with a StringReader > throw new RuntimeException(e); > } {code} > Output is: > {code:java} > //text1 > マギ > アリス > 単 > 版 > 話 > 4 > 話 > unlimited > comics > //text2 > マギア > リス > 単 > 版 > 話 > 4 > 話 > unlimited > comics{code} > It looks like tokenizer doesn't view the punctuation ({{【}} is > {{Character.START_PUNCTUATION}} type) as an indicator that there should be a > token break, and somehow 【 punctuation char causes difference in the output. > If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't > manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and > outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org