[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305997#comment-17305997 ]
Tomoko Uchida commented on LUCENE-9413: --------------------------------------- FYI: I'm planning to change the JapaneseAnalyzer default behaviour to use CJKWidthCharFilter instead of CJKCharFilter (on main branch only). LUCENE-9853 > Add a char filter corresponding to CJKWidthFilter > ------------------------------------------------- > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Tomoko Uchida > Assignee: Tomoko Uchida > Priority: Minor > Fix For: main (9.0), 8.8 > > Time Spent: 0.5h > Remaining Estimate: 0h > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org