[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425488#comment-13425488 ]
Christian Moen commented on LUCENE-3922: ---------------------------------------- I've attached a work-in-progress patch for {{trunk}} that implements a {{CharFilter}} that normalizes Japanese numbers. These are some TODOs and implementation considerations I have that I'd be thankful to get feedback on: * Buffering the entire input on the first read should be avoided. The primary reason this is done is because I was thinking to add some regexps before and after kanji numeric strings to qualify their normalization, i.e. to only normalize strings that starts with ¥, JPY or ends with 円, to only normalize monetary amounts in Japanese yen. However, this probably isn't necessary as we can probably can use {{Matcher.requireEnd()}} and {{Matcher.hitEnd()}} to decide if we need to read more input. (Thanks, Robert!) * Is qualifying the numbers to be normalized with prefix and suffix regexps useful, i.e. to only normalize monetary amounts? * How do we deal with leading zeros? Currently, "007" and "◯◯七" becomes "7" today. Do we want an option to preserve leading zeros? * How large numbers do we care about supporting? Some of the larger numbers are surrogates, which complicates implementation, but they're certainly possible. If we don't care about really large numbers, we can probably be fine working with {{long}} instead of {{BigInteger}}. * Polite numbers and some other variants aren't supported, i.e. 壱, 弐, 参, etc., but they can easily be added. We can also add the obsolete variants if that's useful somehow. Are these useful? Do we want them available via an option? * Number formats such as "1億2,345万6,789" isn't supported - we don't deal with the comma today, but this can be added. The same applies to "12 345" where there's a space that separates thousands like in French. Numbers like "2・2兆" aren't supported, but can be added. * Only integers are supported today, so we can't parse "〇・一二三四", which becomes "0" and "1234" as separate tokens instead of "0.1234" There are probably other considerations, too, that I doesn't immediately come to mind. Numbers are fairly complicated and feedback on direction for further implementation is most appreciated. Thanks. > Add Japanese Kanji number normalization to Kuromoji > --------------------------------------------------- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Affects Versions: 4.0-ALPHA > Reporter: Kazuaki Hiraga > Labels: features > Attachments: LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org