[
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232119#comment-17232119
]
Tomoko Uchida commented on LUCENE-9413:
---------------------------------------
[https://github.com/apache/lucene-solr/pull/2081] adds CJKWidthCharFilter that
is the exact counterpart of CJKWidthFilter. The charfilter would be useful
especially for dictionary-based CJK analyzers; e.g. kuromoji.
[~rcmuir] what do you think - would you take a look at this?
> Add a char filter corresponding to CJKWidthFilter
> -------------------------------------------------
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Tomoko Uchida
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In association with issues in Elasticsearch
> ([https://github.com/elastic/elasticsearch/issues/58384] and
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width
> characters before tokenization, the behaviour sometimes confuses beginners or
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently,
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization
> so some of FULL width numbers or latin alphabets are separated by the
> tokenizer).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]