[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

Christian Moen (JIRA) Mon, 30 Jul 2012 20:37:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425488#comment-13425488
 ]


Christian Moen commented on LUCENE-3922:
----------------------------------------

I've attached a work-in-progress patch for {{trunk}} that implements a 
{{CharFilter}} that normalizes Japanese numbers.

These are some TODOs and implementation considerations I have that I'd be 
thankful to get feedback on:

* Buffering the entire input on the first read should be avoided.  The primary 
reason this is done is because I was thinking to add some regexps before and 
after kanji numeric strings to qualify their normalization, i.e. to only 
normalize strings that starts with ￥, JPY or ends with 円, to only normalize 
monetary amounts in Japanese yen.  However, this probably isn't necessary as we 
can probably can use {{Matcher.requireEnd()}} and {{Matcher.hitEnd()}} to 
decide if we need to read more input. (Thanks, Robert!)

* Is qualifying the numbers to be normalized with prefix and suffix regexps 
useful, i.e. to only normalize monetary amounts?

* How do we deal with leading zeros?  Currently, "007" and "◯◯七" becomes "7" 
today.  Do we want an option to preserve leading zeros?

* How large numbers do we care about supporting?  Some of the larger numbers 
are surrogates, which complicates implementation, but they're certainly 
possible.  If we don't care about really large numbers, we can probably be fine 
working with {{long}} instead of {{BigInteger}}.

* Polite numbers and some other variants aren't supported, i.e. 壱, 弐, 参, etc., 
but they can easily be added.  We can also add the obsolete variants if that's 
useful somehow.  Are these useful?  Do we want them available via an option?

* Number formats such as "１億２，３４５万６，７８９" isn't supported - we don't deal with 
the comma today, but this can be added.  The same applies to "１２　３４５" where 
there's a space that separates thousands like in French.  Numbers like "2・2兆" 
aren't supported, but can be added.

* Only integers are supported today, so we can't parse "〇・一二三四", which becomes 
"0" and "1234" as separate tokens instead of "0.1234"

There are probably other considerations, too, that I doesn't immediately come 
to mind.

Numbers are fairly complicated and feedback on direction for further 
implementation is most appreciated.  Thanks.
                
> Add Japanese Kanji number normalization to Kuromoji
> ---------------------------------------------------
>
>                 Key: LUCENE-3922
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3922
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.0-ALPHA
>            Reporter: Kazuaki Hiraga
>              Labels: features
>         Attachments: LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

Reply via email to