[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148068#comment-17148068 ]
Jim Ferenczi commented on LUCENE-9413: -------------------------------------- +1, I like the idea, currently we ask users to install the icu normalizer but it could be nice to have a simple char filter in core to apply the normalization. In essence, this is similar to https://issues.apache.org/jira/browse/LUCENE-8972 but with a more contained scope. > The mecab-ipadic dictionary has entries which includes FULL width characters, >so this naive approach - FULL / HALF width character normalization before >tokenizing can break tokenization. :/ I think that's an acceptable trade-off, these entries with full width characters don't seem to be high quality anyway ;). > Add a char filter corresponding to CJKWidthFilter > ------------------------------------------------- > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Tomoko Uchida > Priority: Minor > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org