[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233486#comment-17233486 ]
ASF subversion and git services commented on LUCENE-9413: --------------------------------------------------------- Commit 26b55463ffd4afe159d4aaee9713c766938e4276 in lucene-solr's branch refs/heads/branch_8x from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=26b5546 ] LUCENE-9413: Add CJKWidthCharFilter and its factory. (#2081) > Add a char filter corresponding to CJKWidthFilter > ------------------------------------------------- > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Tomoko Uchida > Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org