[jira] [Created] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
Kazuaki Hiraga created LUCENE-4056: -- Summary: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary Key: LUCENE-4056 URL: https://issues.apache.org/jira/browse/LUCENE-4056 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Environment: Solr 3.6 UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz) Reporter: Kazuaki Hiraga I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should support UniDic dictionary as standalone Kuromoji does. The following is my procedure: Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict', I got the error as the below. build-dict: [java] dictionary builder [java] [java] dictionary format: UNIDIC [java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src [java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources [java] input encoding: utf-8 [java] normalize entries: false [java] [java] building tokeninfo dict... [java] parse... [java] sort... [java] Exception in thread main java.lang.AssertionError [java] encode... [java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82) And the diff of build.xml: === --- build.xml (revision 1338023) +++ build.xml (working copy) @@ -28,19 +28,31 @@ property name=maven.dist.dir location=../../../dist/maven / !-- default configuration: uses mecab-ipadic -- + !-- property name=ipadic.version value=mecab-ipadic-2.7.0-20070801 / property name=dict.src.file value=${ipadic.version}.tar.gz / property name=dict.url value=http://mecab.googlecode.com/files/${dict.src.file}/ + -- !-- alternative configuration: uses mecab-naist-jdic property name=ipadic.version value=mecab-naist-jdic-0.6.3b-20111013 / property name=dict.src.file value=${ipadic.version}.tar.gz / property name=dict.url value=http://sourceforge.jp/frs/redir.php?m=iijamp;f=/naist-jdic/53500/${dict.src.file}/ -- - + + !-- alternative configuration: uses UniDic -- + property name=ipadic.version value=unidic-mecab1312src / + property name=dict.src.file value=unidic-mecab1312src.tar.gz / + property name=dict.loc.dir value=/home/kazu/Work/src/nlp/unidic/_archive/ + property name=dict.src.dir value=${build.dir}/${ipadic.version} / + !-- property name=dict.encoding value=euc-jp/ property name=dict.format value=ipadic/ + -- + property name=dict.encoding value=utf-8/ + property name=dict.format value=unidic/ + property name=dict.normalize value=false/ property name=dict.target.dir location=./src/resources/ @@ -58,7 +70,8 @@ target name=compile-core depends=jar-analyzers-common, common.compile-core / target name=download-dict unless=dict.available - get src=${dict.url} dest=${build.dir}/${dict.src.file}/ + !-- get src=${dict.url} dest=${build.dir}/${dict.src.file}/ -- + copy file=${dict.loc.dir}/${dict.src.file} tofile=${build.dir}/${dict.src.file}/ gunzip src=${build.dir}/${dict.src.file}/ untar src=${build.dir}/${ipadic.version}.tar dest=${build.dir}/ /target -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
[ https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276429#comment-13276429 ] Kazuaki Hiraga commented on LUCENE-4056: Hi Christian, Thank you for your comment. I understand the situation. I didn't expect that UniDic is bundled and shipped with Kuromoji. For the time being, I just want to buiild and use it with Kuromoji for lucene/Solr. We just started evaluation of UniDic but it's a very early stage, so We don't have any conclusion that We have to or need to use UniDic instead of IPA dictionary. However we haven't finished our evaluation of UniDic, I like the concept and policy of UniDic that strictly define how to specify the tokens. And I am satisfied with the result of tokenization. I think It's better than IPA dictionary regarding the Katakana segmentation and compound segmentation. On the other hand, I understand there's a license issue that We have to resolve if we decide to use it in our internal services. Thanks for reminding me. Thanks. Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary Key: LUCENE-4056 URL: https://issues.apache.org/jira/browse/LUCENE-4056 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Environment: Solr 3.6 UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz) Reporter: Kazuaki Hiraga I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should support UniDic dictionary as standalone Kuromoji does. The following is my procedure: Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict', I got the error as the below. build-dict: [java] dictionary builder [java] [java] dictionary format: UNIDIC [java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src [java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources [java] input encoding: utf-8 [java] normalize entries: false [java] [java] building tokeninfo dict... [java] parse... [java] sort... [java] Exception in thread main java.lang.AssertionError [java] encode... [java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82) And the diff of build.xml: === --- build.xml (revision 1338023) +++ build.xml (working copy) @@ -28,19 +28,31 @@ property name=maven.dist.dir location=../../../dist/maven / !-- default configuration: uses mecab-ipadic -- + !-- property name=ipadic.version value=mecab-ipadic-2.7.0-20070801 / property name=dict.src.file value=${ipadic.version}.tar.gz / property name=dict.url value=http://mecab.googlecode.com/files/${dict.src.file}/ + -- !-- alternative configuration: uses mecab-naist-jdic property name=ipadic.version value=mecab-naist-jdic-0.6.3b-20111013 / property name=dict.src.file value=${ipadic.version}.tar.gz / property name=dict.url value=http://sourceforge.jp/frs/redir.php?m=iijamp;f=/naist-jdic/53500/${dict.src.file}/ -- - + + !-- alternative configuration: uses UniDic -- + property name=ipadic.version value=unidic-mecab1312src / + property name=dict.src.file value=unidic-mecab1312src.tar.gz / + property name=dict.loc.dir value=/home/kazu/Work/src/nlp/unidic/_archive/ + property name=dict.src.dir value=${build.dir}/${ipadic.version} / + !-- property name=dict.encoding value=euc-jp/ property name=dict.format value=ipadic/ + -- + property name=dict.encoding value=utf-8/ + property name=dict.format value=unidic/ + property name=dict.normalize value=false/ property name=dict.target.dir location=./src/resources/ @@ -58,7 +70,8 @@ target name=compile-core depends=jar-analyzers-common, common.compile-core / target name=download-dict unless=dict.available - get src=${dict.url} dest=${build.dir}/${dict.src.file}/ + !-- get src=${dict.url} dest=${build.dir}/${dict.src.file}/
[jira] [Created] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
Kazuaki Hiraga created SOLR-3524: Summary: Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Priority: Minor JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291836#comment-13291836 ] Kazuaki Hiraga commented on SOLR-3524: -- Thank you guys! Christian, Since some documents have keywords that consists of alphabet and punctuation such as c++, c# and so on, We want to match those keywords with the keyword that unchanged form. Of course, we will discard punctuation in many cases but some cases, especially short text, we want to preserve punctuation. Therefore, I want to have an option that I can control this behaviour. Ohtani-san, thank you for your early reply and patch! Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory --- Key: SOLR-3524 URL: https://issues.apache.org/jira/browse/SOLR-3524 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6 Reporter: Kazuaki Hiraga Priority: Minor Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation. I would like to have an option I can configure this behavior by fieldtype definition in schema.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471068#comment-13471068 ] Kazuaki Hiraga commented on LUCENE-3922: Sorry for this late reply. Although I have some request to improve capability, this is very helpful and nice charfilter for me. Thank you! Christian!! My requests are the following: Is it difficult to support numbers with period as the following? 3.2兆円 5.2億円 On the other hand, I agree with Christian to not preserving leading zeros. So, ◯◯七 doesn't need to become 007. I think It would be helpful that this charfilter supports old Kanji numeric characters (KYU-KANJI or DAIJI) such as 壱, 壹 (One), 弌, 弐, 貳 (Two), 弍, 参,參 (Three), or configureable. Add Japanese Kanji number normalization to Kuromoji --- Key: LUCENE-3922 URL: https://issues.apache.org/jira/browse/LUCENE-3922 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.0-ALPHA Reporter: Kazuaki Hiraga Labels: features Attachments: LUCENE-3922.patch Japanese people use Kanji numerals instead of Arabic numerals for writing price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December). So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we need to have a capability to normalize to Kanji numerals). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471123#comment-13471123 ] Kazuaki Hiraga commented on LUCENE-3922: Lance, you may be right. Although I have never seen that Japanese people use Kanji numbers for James Bond movies :-), I can't say that we never use Kanji for that kind of expression. Christian, Is it possible to choose preserve leading zeros or not? Add Japanese Kanji number normalization to Kuromoji --- Key: LUCENE-3922 URL: https://issues.apache.org/jira/browse/LUCENE-3922 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.0-ALPHA Reporter: Kazuaki Hiraga Labels: features Attachments: LUCENE-3922.patch Japanese people use Kanji numerals instead of Arabic numerals for writing price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December). So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we need to have a capability to normalize to Kanji numerals). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474210#comment-13474210 ] Kazuaki Hiraga commented on LUCENE-3922: The following examples are false positive case: 姿三四郎 became 姿, 34, 郎 小林一茶 became 小林, 1, 茶 鈴木一郎 became 鈴木, 1, 郎 Can we prevent this behavior? Add Japanese Kanji number normalization to Kuromoji --- Key: LUCENE-3922 URL: https://issues.apache.org/jira/browse/LUCENE-3922 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.0-ALPHA Reporter: Kazuaki Hiraga Labels: features Attachments: LUCENE-3922.patch Japanese people use Kanji numerals instead of Arabic numerals for writing price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December). So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we need to have a capability to normalize to Kanji numerals). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474257#comment-13474257 ] Kazuaki Hiraga commented on LUCENE-3922: Hi Christian, That what I am thinking. I think TokenFilter would be a good choice to implement that feature. We can use POS tag to recognize what a token is. We can apply normalization if a token is a numeral prefix/suffix with numerals. Add Japanese Kanji number normalization to Kuromoji --- Key: LUCENE-3922 URL: https://issues.apache.org/jira/browse/LUCENE-3922 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.0-ALPHA Reporter: Kazuaki Hiraga Labels: features Attachments: LUCENE-3922.patch Japanese people use Kanji numerals instead of Arabic numerals for writing price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December). So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we need to have a capability to normalize to Kanji numerals). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474257#comment-13474257 ] Kazuaki Hiraga edited comment on LUCENE-3922 at 10/12/12 2:16 PM: -- Hi Christian, That's what I am thinking. I think TokenFilter would be a good choice to implement that feature. We can use POS tag to recognize what a token is. We can apply normalization if a token is a numeral prefix/suffix with numerals. was (Author: h.kazuaki): Hi Christian, That what I am thinking. I think TokenFilter would be a good choice to implement that feature. We can use POS tag to recognize what a token is. We can apply normalization if a token is a numeral prefix/suffix with numerals. Add Japanese Kanji number normalization to Kuromoji --- Key: LUCENE-3922 URL: https://issues.apache.org/jira/browse/LUCENE-3922 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.0-ALPHA Reporter: Kazuaki Hiraga Labels: features Attachments: LUCENE-3922.patch Japanese people use Kanji numerals instead of Arabic numerals for writing price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December). So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we need to have a capability to normalize to Kanji numerals). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13475016#comment-13475016 ] Kazuaki Hiraga commented on LUCENE-3922: It would be nice if we can choose expand them or normalize them. I have a concern that Solr's query-side synonym expansion doesn't work well if number of tokens are different between original tokens and synonym tokens, especially if we want to do phrase matching with query-side synonym expansion will be a disaster (Of course, reduction or index-side would be better. But, we sometimes need to use TokenFilter that provides such capability in query-side.) So, I would like to choose the configuration that Kanji numerals normalize to Arabic numerals or Arabic numerals store along with Kanji numerals. Add Japanese Kanji number normalization to Kuromoji --- Key: LUCENE-3922 URL: https://issues.apache.org/jira/browse/LUCENE-3922 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.0-ALPHA Reporter: Kazuaki Hiraga Labels: features Attachments: LUCENE-3922.patch Japanese people use Kanji numerals instead of Arabic numerals for writing price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December). So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we need to have a capability to normalize to Kanji numerals). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426340#comment-13426340 ] Kazuaki Hiraga commented on LUCENE-3922: Hi Christian, Great! I will test your patch and get back to you!! Thanks, Kazu Add Japanese Kanji number normalization to Kuromoji --- Key: LUCENE-3922 URL: https://issues.apache.org/jira/browse/LUCENE-3922 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.0-ALPHA Reporter: Kazuaki Hiraga Labels: features Attachments: LUCENE-3922.patch Japanese people use Kanji numerals instead of Arabic numerals for writing price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December). So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we need to have a capability to normalize to Kanji numerals). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285565#comment-14285565 ] Kazuaki Hiraga commented on LUCENE-3922: [~cm] , sounds great! Can I test this feature? If yes, what version should I use? Add Japanese Kanji number normalization to Kuromoji --- Key: LUCENE-3922 URL: https://issues.apache.org/jira/browse/LUCENE-3922 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.0-ALPHA Reporter: Kazuaki Hiraga Assignee: Christian Moen Labels: features Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch Japanese people use Kanji numerals instead of Arabic numerals for writing price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December). So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we need to have a capability to normalize to Kanji numerals). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Hiraga resolved LUCENE-3922. Resolution: Fixed Lucene Fields: (was: New) > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Major > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694286#comment-16694286 ] Kazuaki Hiraga commented on LUCENE-3922: I have confirmed that there are still some normalization issues that incorrectly normalize Kanji numerals. However, implementation itself has been finished and merged into the main branch. Thus, I will close this ticket and file another ticket to report normalization issues and send patches. > Add Japanese Kanji number normalization to Kuromoji > --- > > Key: LUCENE-3922 > URL: https://issues.apache.org/jira/browse/LUCENE-3922 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.0-ALPHA >Reporter: Kazuaki Hiraga >Assignee: Christian Moen >Priority: Major > Labels: features > Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, > LUCENE-3922.patch > > > Japanese people use Kanji numerals instead of Arabic numerals for writing > price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and > 十二月(December). So, we would like to normalize those Kanji numerals to Arabic > numerals (I don't think we need to have a capability to normalize to Kanji > numerals). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
[ https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16828033#comment-16828033 ] Kazuaki Hiraga commented on LUCENE-4056: [~Tomoko Uchida] I am going to prepare a patch. So, let's work together to fix the issue. > Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary > > > Key: LUCENE-4056 > URL: https://issues.apache.org/jira/browse/LUCENE-4056 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 3.6 > Environment: Solr 3.6 > UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz) >Reporter: Kazuaki Hiraga >Priority: Major > > I tried to build a UniDic dictionary for using it along with Kuromoji on Solr > 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for > Lucene/Solr should support UniDic dictionary as standalone Kuromoji does. > The following is my procedure: > Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run > 'ant build-dict', I got the error as the below. > build-dict: > [java] dictionary builder > [java] > [java] dictionary format: UNIDIC > [java] input directory: > /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src > [java] output directory: > /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources > [java] input encoding: utf-8 > [java] normalize entries: false > [java] > [java] building tokeninfo dict... > [java] parse... > [java] sort... > [java] Exception in thread "main" java.lang.AssertionError > [java] encode... > [java] at > org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113) > [java] at > org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141) > [java] at > org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76) > [java] at > org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37) > [java] at > org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82) > And the diff of build.xml: > === > --- build.xml (revision 1338023) > +++ build.xml (working copy) > @@ -28,19 +28,31 @@ > > > > + > > > - > + > + > + > + > + value="/home/kazu/Work/src/nlp/unidic/_archive"/> > + > > + > + > + > + > > > > @@ -58,7 +70,8 @@ > > > > - > + > + tofile="${build.dir}/${dict.src.file}"/> > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8777) Inconsistent behavior in JapaneseTokenizer search mode
[ https://issues.apache.org/jira/browse/LUCENE-8777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824940#comment-16824940 ] Kazuaki Hiraga commented on LUCENE-8777: I think the first one is expected behavior from the current user dictionary perspective. If we want to change the behavior, the ticket might mislead the issue. I think it should be *changing behavior of user dictionary*. > Inconsistent behavior in JapaneseTokenizer search mode > -- > > Key: LUCENE-8777 > URL: https://issues.apache.org/jira/browse/LUCENE-8777 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Minor > > A user reported to me about inconsistent behaviour in JapaneseTokenizer's > search mode. > Without user dictionary, JapaneseTokenizer (mode=search) outputs "long token" > and all of "short (custom segmented) token"s. > e.g.: > 関西国際空港 => 関西 / 関西国際空港 / 国際 / 空港 > With user dictionary, JapaneseTokenizer (mode=search) outputs all short > tokens but not long token. > e.g.: > {code} > $ cat config/userdict.txt > 関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞 > {code} > 関西国際空港 => 関西 / 国際 / 空港 > > This behaviour is confusing for users and would be better to be fixed. I am > not sure which behaviour is correct, but in my perspective the first one > (without user dictionary) is preferable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
[ https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825122#comment-16825122 ] Kazuaki Hiraga commented on LUCENE-4056: I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for Japanese full-text information retrieval since the dictionary is well maintained by researchers of Japanese government funded institute and applies stricter rules than IPAdictionary that intend to produce consistent tokenization results. > Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary > > > Key: LUCENE-4056 > URL: https://issues.apache.org/jira/browse/LUCENE-4056 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 3.6 > Environment: Solr 3.6 > UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz) >Reporter: Kazuaki Hiraga >Priority: Major > > I tried to build a UniDic dictionary for using it along with Kuromoji on Solr > 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for > Lucene/Solr should support UniDic dictionary as standalone Kuromoji does. > The following is my procedure: > Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run > 'ant build-dict', I got the error as the below. > build-dict: > [java] dictionary builder > [java] > [java] dictionary format: UNIDIC > [java] input directory: > /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src > [java] output directory: > /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources > [java] input encoding: utf-8 > [java] normalize entries: false > [java] > [java] building tokeninfo dict... > [java] parse... > [java] sort... > [java] Exception in thread "main" java.lang.AssertionError > [java] encode... > [java] at > org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113) > [java] at > org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141) > [java] at > org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76) > [java] at > org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37) > [java] at > org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82) > And the diff of build.xml: > === > --- build.xml (revision 1338023) > +++ build.xml (working copy) > @@ -28,19 +28,31 @@ > > > > + > > > - > + > + > + > + > + value="/home/kazu/Work/src/nlp/unidic/_archive"/> > + > > + > + > + > + > > > > @@ -58,7 +70,8 @@ > > > > - > + > + tofile="${build.dir}/${dict.src.file}"/> > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
[ https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825122#comment-16825122 ] Kazuaki Hiraga edited comment on LUCENE-4056 at 4/24/19 1:00 PM: - I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for Japanese full-text information retrieval since the dictionary is well maintained by researchers of Japanese government funded institute and it applies stricter rules than IPA dictionary that intends to produce consistent tokenization results. was (Author: h.kazuaki): I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for Japanese full-text information retrieval since the dictionary is well maintained by researchers of Japanese government funded institute and applies stricter rules than IPAdictionary that intend to produce consistent tokenization results. > Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary > > > Key: LUCENE-4056 > URL: https://issues.apache.org/jira/browse/LUCENE-4056 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Affects Versions: 3.6 > Environment: Solr 3.6 > UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz) >Reporter: Kazuaki Hiraga >Priority: Major > > I tried to build a UniDic dictionary for using it along with Kuromoji on Solr > 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for > Lucene/Solr should support UniDic dictionary as standalone Kuromoji does. > The following is my procedure: > Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run > 'ant build-dict', I got the error as the below. > build-dict: > [java] dictionary builder > [java] > [java] dictionary format: UNIDIC > [java] input directory: > /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src > [java] output directory: > /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources > [java] input encoding: utf-8 > [java] normalize entries: false > [java] > [java] building tokeninfo dict... > [java] parse... > [java] sort... > [java] Exception in thread "main" java.lang.AssertionError > [java] encode... > [java] at > org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113) > [java] at > org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141) > [java] at > org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76) > [java] at > org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37) > [java] at > org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82) > And the diff of build.xml: > === > --- build.xml (revision 1338023) > +++ build.xml (working copy) > @@ -28,19 +28,31 @@ > > > > + > > > - > + > + > + > + > + value="/home/kazu/Work/src/nlp/unidic/_archive"/> > + > > + > + > + > + > > > > @@ -58,7 +70,8 @@ > > > > - > + > + tofile="${build.dir}/${dict.src.file}"/> > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org