[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

Tomoko Uchida (Jira) Sat, 20 Jun 2020 01:27:50 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141009#comment-17141009
 ]


Tomoko Uchida commented on LUCENE-9413:
---------------------------------------

The mecab-ipadic dictionary has entries which includes FULL width characters, 
so this naive approach - FULL / HALF width character normalization before 
tokenizing can break tokenization. :/

Maybe we could concat "unknown" word sequence which consists of only numbers or 
latin alphabets, after tokenization ?

{code}
$ cut -d',' -f1 mecab-ipadic-all-utf8.csv | grep １
１２月
１番
１１月
１月
１０月
Ｇ７プラス１
小１
高１
１つ
Ｆ１
中１
１１０番
Ｇ１
１
ファスニング２１
Ｇ１０
インパクト２１
アルゴテクノス２１
セルヴィ２１
モクネット２１
Ｕ１９
どさんこワイド２１２
西１５線北
北１３線
西１４線北
北１４線
西１０号南
南１条
東１１号北
東１２線北
西１１号北
駒場北１条通
東１線南
第１安井牧場
西１０号北
東１１線北
美旗町中１番
南２１線西
南１７線西
西１０線北
岩内町第１基線
北１５線
南１２線西
東１３線南
西１３線北
西１線北
南１６線西
西１０線南
西１６線北
西１１線北
西１２号北
西１１線南
東１０線北
北１線
東１線北
南１３号
南１４線西
南１線
北１１線
西１２線南
西１４線南
南１３線西
浦臼第１
西１３線南
東１０号北
南１９線西
北１条
南１１線西
平泉外１２入会
東１０線南
東１０号南
南１８線西
南１５線西
東１１号南
東１２号北
北１０線
駒場南１条通
南１番通
南１０線西
北１２線
西１線南
太田１の通り
東１１線南
西１２線北
東１２線南
大泉１区南部
Ｍ４０Ａ１
Ｆ１５戦闘機
ＤＦ３１
Ｆ１５
Ｇ１
辞林２１
Ｒ１２
Ｏ１５７
ＤＦ４１
スーパー３０１
ＧＰ１２５
北１３条東
Ｍ１Ａ２
アポロ１１号
{code}

> Add a char filter corresponding to CJKWidthFilter
> -------------------------------------------------
>
>                 Key: LUCENE-9413
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9413
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tomoko Uchida
>            Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

Reply via email to