[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

Jim Ferenczi (Jira) Mon, 29 Jun 2020 11:33:47 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148068#comment-17148068
 ]


Jim Ferenczi commented on LUCENE-9413:
--------------------------------------

+1, I like the idea, currently we ask users to install the icu normalizer but 
it could be nice to have a simple char filter in core to apply the 
normalization. In essence, this is similar to 
https://issues.apache.org/jira/browse/LUCENE-8972 but with a more contained 
scope.

 

> The mecab-ipadic dictionary has entries which includes FULL width characters, 
>so this naive approach - FULL / HALF width character normalization before 
>tokenizing can break tokenization. :/

 

I think that's an acceptable trade-off,  these entries with full width 
characters don't seem to be high quality anyway ;). 

> Add a char filter corresponding to CJKWidthFilter
> -------------------------------------------------
>
>                 Key: LUCENE-9413
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9413
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tomoko Uchida
>            Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

Reply via email to