[jira] [Created] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

2012-05-14 Thread Kazuaki Hiraga (JIRA)
Kazuaki Hiraga created LUCENE-4056:
--

 Summary: Japanese Tokenizer (Kuromoji) cannot build UniDic 
dictionary
 Key: LUCENE-4056
 URL: https://issues.apache.org/jira/browse/LUCENE-4056
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
 Environment: Solr 3.6
UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
Reporter: Kazuaki Hiraga


I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 
3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for 
Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.

The following is my procedure:
Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 
'ant build-dict', I got the error as the below.

build-dict:
 [java] dictionary builder
 [java] 
 [java] dictionary format: UNIDIC
 [java] input directory: 
/home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
 [java] output directory: 
/home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
 [java] input encoding: utf-8
 [java] normalize entries: false
 [java] 
 [java] building tokeninfo dict...
 [java]   parse...
 [java]   sort...
 [java] Exception in thread main java.lang.AssertionError
 [java]   encode...
 [java] at 
org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
 [java] at 
org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
 [java] at 
org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
 [java] at 
org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
 [java] at 
org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)



And the diff of build.xml:

===
--- build.xml   (revision 1338023)
+++ build.xml   (working copy)
@@ -28,19 +28,31 @@
   property name=maven.dist.dir location=../../../dist/maven /
 
   !-- default configuration: uses mecab-ipadic --
+  !--
   property name=ipadic.version value=mecab-ipadic-2.7.0-20070801 /
   property name=dict.src.file value=${ipadic.version}.tar.gz /
   property name=dict.url 
value=http://mecab.googlecode.com/files/${dict.src.file}/
+  --
 
   !-- alternative configuration: uses mecab-naist-jdic
   property name=ipadic.version value=mecab-naist-jdic-0.6.3b-20111013 /
   property name=dict.src.file value=${ipadic.version}.tar.gz /
   property name=dict.url 
value=http://sourceforge.jp/frs/redir.php?m=iijamp;f=/naist-jdic/53500/${dict.src.file}/
   --
-  
+
+  !-- alternative configuration: uses UniDic --
+  property name=ipadic.version value=unidic-mecab1312src /
+  property name=dict.src.file value=unidic-mecab1312src.tar.gz /
+  property name=dict.loc.dir 
value=/home/kazu/Work/src/nlp/unidic/_archive/
+
   property name=dict.src.dir value=${build.dir}/${ipadic.version} /
+  !--
   property name=dict.encoding value=euc-jp/
   property name=dict.format value=ipadic/
+  --
+  property name=dict.encoding value=utf-8/
+  property name=dict.format value=unidic/
+
   property name=dict.normalize value=false/
   property name=dict.target.dir location=./src/resources/
 
@@ -58,7 +70,8 @@
 
   target name=compile-core depends=jar-analyzers-common, 
common.compile-core /
   target name=download-dict unless=dict.available
- get src=${dict.url} dest=${build.dir}/${dict.src.file}/
+ !-- get src=${dict.url} dest=${build.dir}/${dict.src.file}/ --
+ copy file=${dict.loc.dir}/${dict.src.file} 
tofile=${build.dir}/${dict.src.file}/
  gunzip src=${build.dir}/${dict.src.file}/
  untar src=${build.dir}/${ipadic.version}.tar dest=${build.dir}/
   /target


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

2012-05-15 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276429#comment-13276429
 ] 

Kazuaki Hiraga commented on LUCENE-4056:


Hi Christian,

Thank you for your comment.

I understand the situation. I didn't expect that UniDic is bundled and shipped 
with Kuromoji. For the time being, I just want to buiild and use it with 
Kuromoji for lucene/Solr.

We just started evaluation of UniDic but it's a very early stage, so We don't 
have any conclusion that We have to or need to use UniDic instead of IPA 
dictionary. However we haven't finished our evaluation of UniDic, I like the 
concept and policy of UniDic that strictly define how to specify the tokens. 
And I am satisfied with the result of tokenization. I think It's better than 
IPA dictionary regarding the Katakana segmentation and compound segmentation.

On the other hand, I understand there's a license issue that We have to resolve 
if we decide to use it in our internal services. Thanks for reminding me.

Thanks.

 Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
 

 Key: LUCENE-4056
 URL: https://issues.apache.org/jira/browse/LUCENE-4056
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
 Environment: Solr 3.6
 UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
Reporter: Kazuaki Hiraga

 I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 
 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for 
 Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
 The following is my procedure:
 Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 
 'ant build-dict', I got the error as the below.
 build-dict:
  [java] dictionary builder
  [java] 
  [java] dictionary format: UNIDIC
  [java] input directory: 
 /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
  [java] output directory: 
 /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
  [java] input encoding: utf-8
  [java] normalize entries: false
  [java] 
  [java] building tokeninfo dict...
  [java]   parse...
  [java]   sort...
  [java] Exception in thread main java.lang.AssertionError
  [java]   encode...
  [java]   at 
 org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
  [java]   at 
 org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
  [java]   at 
 org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
  [java]   at 
 org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
  [java]   at 
 org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
 And the diff of build.xml:
 ===
 --- build.xml (revision 1338023)
 +++ build.xml (working copy)
 @@ -28,19 +28,31 @@
property name=maven.dist.dir location=../../../dist/maven /
  
!-- default configuration: uses mecab-ipadic --
 +  !--
property name=ipadic.version value=mecab-ipadic-2.7.0-20070801 /
property name=dict.src.file value=${ipadic.version}.tar.gz /
property name=dict.url 
 value=http://mecab.googlecode.com/files/${dict.src.file}/
 +  --
  
!-- alternative configuration: uses mecab-naist-jdic
property name=ipadic.version value=mecab-naist-jdic-0.6.3b-20111013 /
property name=dict.src.file value=${ipadic.version}.tar.gz /
property name=dict.url 
 value=http://sourceforge.jp/frs/redir.php?m=iijamp;f=/naist-jdic/53500/${dict.src.file}/
--
 -  
 +
 +  !-- alternative configuration: uses UniDic --
 +  property name=ipadic.version value=unidic-mecab1312src /
 +  property name=dict.src.file value=unidic-mecab1312src.tar.gz /
 +  property name=dict.loc.dir 
 value=/home/kazu/Work/src/nlp/unidic/_archive/
 +
property name=dict.src.dir value=${build.dir}/${ipadic.version} /
 +  !--
property name=dict.encoding value=euc-jp/
property name=dict.format value=ipadic/
 +  --
 +  property name=dict.encoding value=utf-8/
 +  property name=dict.format value=unidic/
 +
property name=dict.normalize value=false/
property name=dict.target.dir location=./src/resources/
  
 @@ -58,7 +70,8 @@
  
target name=compile-core depends=jar-analyzers-common, 
 common.compile-core /
target name=download-dict unless=dict.available
 - get src=${dict.url} dest=${build.dir}/${dict.src.file}/
 + !-- get src=${dict.url} dest=${build.dir}/${dict.src.file}/ 

[jira] [Created] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-06-07 Thread Kazuaki Hiraga (JIRA)
Kazuaki Hiraga created SOLR-3524:


 Summary: Make discard-punctuation feature in Kuromoji configurable 
from JapaneseTokenizerFactory
 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Priority: Minor


JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
punctuation in Japanese text, although It has a parameter to change this 
behavior.  JapaneseTokenizerFactory always set third parameter, which controls 
this behavior, to true to remove punctuation.
I would like to have an option I can configure this behavior by fieldtype 
definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-06-08 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291836#comment-13291836
 ] 

Kazuaki Hiraga commented on SOLR-3524:
--

Thank you guys!
Christian, Since some documents have keywords that consists of alphabet and 
punctuation such as c++, c# and so on, We want to match those keywords with the 
keyword that unchanged form. Of course, we will discard punctuation in many 
cases but some cases, especially short text, we want to preserve punctuation. 
Therefore, I want to have an option that I can control this behaviour.

Ohtani-san, thank you for your early reply and patch! 

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Priority: Minor
 Attachments: SOLR-3524.patch, kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-06 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471068#comment-13471068
 ] 

Kazuaki Hiraga commented on LUCENE-3922:


Sorry for this late reply.

Although I have some request to improve capability, this is very helpful and 
nice charfilter for me.
Thank you! Christian!!

My requests are the following:

Is it difficult to support numbers with period as the following?
3.2兆円
5.2億円

On the other hand, I agree with Christian to not preserving leading zeros. So, 
◯◯七 doesn't need to become 007.

I think It would be helpful that this charfilter supports old Kanji numeric 
characters (KYU-KANJI or DAIJI) such as 壱, 壹 (One), 弌, 弐, 貳 (Two), 弍, 参,參 
(Three), or configureable.

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-06 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471123#comment-13471123
 ] 

Kazuaki Hiraga commented on LUCENE-3922:


Lance, you may be right.  Although I have never seen that Japanese people use 
Kanji numbers for James Bond movies :-), I can't say that we never use Kanji 
for that kind of expression.

Christian, Is it possible to choose preserve leading zeros or not?

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-11 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474210#comment-13474210
 ] 

Kazuaki Hiraga commented on LUCENE-3922:


The following examples are false positive case:
姿三四郎 became 姿, 34, 郎
小林一茶 became 小林, 1, 茶
鈴木一郎 became 鈴木, 1, 郎

Can we prevent this behavior?

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-11 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474257#comment-13474257
 ] 

Kazuaki Hiraga commented on LUCENE-3922:


Hi Christian,

That what I am thinking. I think TokenFilter would be a good choice to 
implement that feature. We can use POS tag to recognize what a token is. We can 
apply normalization if a token is a numeral prefix/suffix with numerals. 

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-12 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474257#comment-13474257
 ] 

Kazuaki Hiraga edited comment on LUCENE-3922 at 10/12/12 2:16 PM:
--

Hi Christian,

That's what I am thinking. I think TokenFilter would be a good choice to 
implement that feature. We can use POS tag to recognize what a token is. We can 
apply normalization if a token is a numeral prefix/suffix with numerals. 


  was (Author: h.kazuaki):
Hi Christian,

That what I am thinking. I think TokenFilter would be a good choice to 
implement that feature. We can use POS tag to recognize what a token is. We can 
apply normalization if a token is a numeral prefix/suffix with numerals. 
  
 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-12 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13475016#comment-13475016
 ] 

Kazuaki Hiraga commented on LUCENE-3922:


It would be nice if we can choose expand them or normalize them.

I have a concern that Solr's query-side synonym expansion doesn't work well if 
number of tokens are different between original tokens and synonym tokens, 
especially if we want to do phrase matching with query-side synonym expansion 
will be a disaster (Of course, reduction or index-side would be better. But, we 
sometimes need to use TokenFilter that provides such capability in query-side.) 
So, I would like to choose the configuration that Kanji numerals normalize to 
Arabic numerals or Arabic numerals store along with Kanji numerals. 

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-07-31 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426340#comment-13426340
 ] 

Kazuaki Hiraga commented on LUCENE-3922:


Hi Christian,

Great! I will test your patch and get back to you!!

Thanks,
Kazu

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-01-21 Thread Kazuaki Hiraga (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285565#comment-14285565
 ] 

Kazuaki Hiraga commented on LUCENE-3922:


[~cm] , sounds great! Can I test this feature? If yes, what version should I 
use?

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch, LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2018-11-20 Thread Kazuaki Hiraga (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Hiraga resolved LUCENE-3922.

   Resolution: Fixed
Lucene Fields:   (was: New)

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>Priority: Major
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2018-11-20 Thread Kazuaki Hiraga (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694286#comment-16694286
 ] 

Kazuaki Hiraga commented on LUCENE-3922:


I have confirmed that there are still some normalization issues that 
incorrectly normalize Kanji numerals. However, implementation itself has been 
finished and merged into the main branch. Thus, I will close this ticket and 
file another ticket to report normalization issues and send patches. 

 

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>Assignee: Christian Moen
>Priority: Major
>  Labels: features
> Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
> LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

2019-04-28 Thread Kazuaki Hiraga (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16828033#comment-16828033
 ] 

Kazuaki Hiraga commented on LUCENE-4056:


[~Tomoko Uchida] I am going to prepare a patch. So, let's work together to fix 
the issue.

> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> 
>
> Key: LUCENE-4056
> URL: https://issues.apache.org/jira/browse/LUCENE-4056
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 3.6
> Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
>Reporter: Kazuaki Hiraga
>Priority: Major
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 
> 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for 
> Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 
> 'ant build-dict', I got the error as the below.
> build-dict:
>  [java] dictionary builder
>  [java] 
>  [java] dictionary format: UNIDIC
>  [java] input directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
>  [java] output directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
>  [java] input encoding: utf-8
>  [java] normalize entries: false
>  [java] 
>  [java] building tokeninfo dict...
>  [java]   parse...
>  [java]   sort...
>  [java] Exception in thread "main" java.lang.AssertionError
>  [java]   encode...
>  [java]   at 
> org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===
> --- build.xml (revision 1338023)
> +++ build.xml (working copy)
> @@ -28,19 +28,31 @@
>
>  
>
> +  
>  
>
> -  
> +
> +  
> +  
> +  
> +   value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
>
> +  
> +  
> +  
> +
>
>
>  
> @@ -58,7 +70,8 @@
>  
>
>
> - 
> + 
> +  tofile="${build.dir}/${dict.src.file}"/>
>   
>   
>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8777) Inconsistent behavior in JapaneseTokenizer search mode

2019-04-24 Thread Kazuaki Hiraga (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824940#comment-16824940
 ] 

Kazuaki Hiraga commented on LUCENE-8777:


I think the first one is expected behavior from the current user dictionary 
perspective. If we want to change the behavior, the ticket might mislead the 
issue. I think it should be *changing behavior of user dictionary*. 

 

> Inconsistent behavior in JapaneseTokenizer search mode
> --
>
> Key: LUCENE-8777
> URL: https://issues.apache.org/jira/browse/LUCENE-8777
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Minor
>
> A user reported to me about inconsistent behaviour in JapaneseTokenizer's 
> search mode.
> Without user dictionary, JapaneseTokenizer (mode=search) outputs "long token" 
> and all of "short (custom segmented) token"s.
> e.g.:
> 関西国際空港 => 関西 / 関西国際空港 / 国際 / 空港
> With user dictionary, JapaneseTokenizer (mode=search) outputs all short 
> tokens but not long token.
> e.g.:
> {code}
> $ cat config/userdict.txt 
> 関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞
> {code}
> 関西国際空港 => 関西 / 国際 / 空港
>  
> This behaviour is confusing for users and would be better to be fixed. I am 
> not sure which behaviour is correct, but in my perspective the first one 
> (without user dictionary) is preferable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

2019-04-24 Thread Kazuaki Hiraga (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825122#comment-16825122
 ] 

Kazuaki Hiraga commented on LUCENE-4056:


I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for 
Japanese full-text information retrieval since the dictionary is well 
maintained by researchers of Japanese government funded institute and applies 
stricter rules than IPAdictionary that intend to produce consistent 
tokenization results. 

> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> 
>
> Key: LUCENE-4056
> URL: https://issues.apache.org/jira/browse/LUCENE-4056
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 3.6
> Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
>Reporter: Kazuaki Hiraga
>Priority: Major
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 
> 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for 
> Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 
> 'ant build-dict', I got the error as the below.
> build-dict:
>  [java] dictionary builder
>  [java] 
>  [java] dictionary format: UNIDIC
>  [java] input directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
>  [java] output directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
>  [java] input encoding: utf-8
>  [java] normalize entries: false
>  [java] 
>  [java] building tokeninfo dict...
>  [java]   parse...
>  [java]   sort...
>  [java] Exception in thread "main" java.lang.AssertionError
>  [java]   encode...
>  [java]   at 
> org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===
> --- build.xml (revision 1338023)
> +++ build.xml (working copy)
> @@ -28,19 +28,31 @@
>
>  
>
> +  
>  
>
> -  
> +
> +  
> +  
> +  
> +   value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
>
> +  
> +  
> +  
> +
>
>
>  
> @@ -58,7 +70,8 @@
>  
>
>
> - 
> + 
> +  tofile="${build.dir}/${dict.src.file}"/>
>   
>   
>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

2019-04-24 Thread Kazuaki Hiraga (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825122#comment-16825122
 ] 

Kazuaki Hiraga edited comment on LUCENE-4056 at 4/24/19 1:00 PM:
-

I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for 
Japanese full-text information retrieval since the dictionary is well 
maintained by researchers of Japanese government funded institute and it 
applies stricter rules than IPA dictionary that intends to produce consistent 
tokenization results. 


was (Author: h.kazuaki):
I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for 
Japanese full-text information retrieval since the dictionary is well 
maintained by researchers of Japanese government funded institute and applies 
stricter rules than IPAdictionary that intend to produce consistent 
tokenization results. 

> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> 
>
> Key: LUCENE-4056
> URL: https://issues.apache.org/jira/browse/LUCENE-4056
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 3.6
> Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
>Reporter: Kazuaki Hiraga
>Priority: Major
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 
> 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for 
> Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 
> 'ant build-dict', I got the error as the below.
> build-dict:
>  [java] dictionary builder
>  [java] 
>  [java] dictionary format: UNIDIC
>  [java] input directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
>  [java] output directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
>  [java] input encoding: utf-8
>  [java] normalize entries: false
>  [java] 
>  [java] building tokeninfo dict...
>  [java]   parse...
>  [java]   sort...
>  [java] Exception in thread "main" java.lang.AssertionError
>  [java]   encode...
>  [java]   at 
> org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
>  [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===
> --- build.xml (revision 1338023)
> +++ build.xml (working copy)
> @@ -28,19 +28,31 @@
>
>  
>
> +  
>  
>
> -  
> +
> +  
> +  
> +  
> +   value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
>
> +  
> +  
> +  
> +
>
>
>  
> @@ -58,7 +70,8 @@
>  
>
>
> - 
> + 
> +  tofile="${build.dir}/${dict.src.file}"/>
>   
>   
>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org