[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tomoko Uchida reopened LUCENE-9413: ----------------------------------- > Add a char filter corresponding to CJKWidthFilter > ------------------------------------------------- > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Tomoko Uchida > Priority: Minor > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org