[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

Namgyu Kim (JIRA) Thu, 23 May 2019 10:12:20 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846871#comment-16846871
 ]


Namgyu Kim commented on LUCENE-8784:
------------------------------------

Thank you for your reply, [~jim.ferenczi]!

Your approach looks awesome.
I developed KoreanNumberFilter by referring to JapaneseNumberFilter.
Please check my patch :D
(use "git apply --whitespace=fix LUCENE-8784.patch" because of trailing 
whitespace error :()

I did not set KoreanNumberFilter as the default filter in KoreanAnalyzer.
By the way, would not it be better to leave the constructors that do not use 
discardPunctuation parameters?
(Existing Nori users have to modify the code after uploading)


>  Nori(Korean) tokenizer removes the decimal point. 
> ---------------------------------------------------
>
>                 Key: LUCENE-8784
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8784
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Munkyu Im
>            Priority: Major
>         Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

Reply via email to