[jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

Christian Moen (JIRA) Mon, 14 May 2012 00:30:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274485#comment-13274485
 ]


Christian Moen commented on LUCENE-4056:
----------------------------------------

Hello Kazu,

Only the IPA dictionary is currently supported.

Adding support for UniDic shouldn't be a big technical issue, but I think we 
perhaps should introduce some notion of dictionary concept in Kuromoji if we 
add support for other dictionaries since weights for the search mode heuristic 
needs to be tuned on a per-dictionary basis.

Also note that the stop tag set used by JapanesePartOfSpeechStopFilter also 
needs to be aligned with the relevant POS tag set of UniDic and we might also 
want to update the stop words list based on UniDic segmentation in order to 
properly support it end-to-end.  (I'd expect it to be the same or nearly the 
same as with IPA, though.)

Hence, adding UniDic support end-to-end creates a cascade of changes.

Even though UniDic is indeed a good dictionary, its license is quite 
restrictive and doesn't allow redistribution, requires a license for commercial 
use, only permits personal use, etc.  This is from {{license.txt}} (in 
Japanese):

{noformat}
UniDic ver.1.3.12 利用条件

1. UniDic ver.1.3.12 の著作権は，伝康晴・山田篤・小椋秀樹・小磯花絵・小木曽智信が保持する。
2. UniDic ver.1.3.12 を複製又は改変することは，個人的な利用に限り認める。
3. UniDic ver.1.3.12 及びこれを改変したものを再配布してはならない。
4. UniDic ver.1.3.12 を利用して行った研究等の成果を公表する場合は，UniDic ver.1.3.12 を利用したことを明記すること。
5. 営利を目的として，UniDic ver.1.3.12 を利用する場合は，事前に著作権者と協議すること。
6. UniDic ver.1.3.12 を利用することによって，直接的・間接的に生じたいかなる損害についても，著作権者は賠償する責任を負わない。
7. 本文書に定めのない事項については，著作権者と協議すること。
{noformat}

As a result, we won't be able to redistribute it and support it out-of-the-box 
with Lucene/Solr, and using it will most likely require a custom dictionary 
build that you have attempted.

Would fixing the dictionary builder for UniDic be a useful starting point in 
your case?

Do you have any information you can share regarding what sort of improvements 
you expect to see with UniDic?  Does it relate to compound segmentation in 
general or katakana compounds?  If you can also share some information on how 
you think we should support UniDic and how big the demand for such support is, 
that would be very useful.  Thanks.
                
> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> ------------------------------------------------------------
>
>                 Key: LUCENE-4056
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4056
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6
>         Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
>            Reporter: Kazuaki Hiraga
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 
> 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for 
> Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 
> 'ant build-dict', I got the error as the below.
> build-dict:
>      [java] dictionary builder
>      [java] 
>      [java] dictionary format: UNIDIC
>      [java] input directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
>      [java] output directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
>      [java] input encoding: utf-8
>      [java] normalize entries: false
>      [java] 
>      [java] building tokeninfo dict...
>      [java]   parse...
>      [java]   sort...
>      [java] Exception in thread "main" java.lang.AssertionError
>      [java]   encode...
>      [java]   at 
> org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===================================================================
> --- build.xml (revision 1338023)
> +++ build.xml (working copy)
> @@ -28,19 +28,31 @@
>    <property name="maven.dist.dir" location="../../../dist/maven" />
>  
>    <!-- default configuration: uses mecab-ipadic -->
> +  <!--
>    <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" 
> value="http://mecab.googlecode.com/files/${dict.src.file}"/>
> +  -->
>  
>    <!-- alternative configuration: uses mecab-naist-jdic
>    <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" 
> value="http://sourceforge.jp/frs/redir.php?m=iij&amp;f=/naist-jdic/53500/${dict.src.file}"/>
>    -->
> -  
> +
> +  <!-- alternative configuration: uses UniDic -->
> +  <property name="ipadic.version" value="unidic-mecab1312src" />
> +  <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
> +  <property name="dict.loc.dir" 
> value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
>    <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
> +  <!--
>    <property name="dict.encoding" value="euc-jp"/>
>    <property name="dict.format" value="ipadic"/>
> +  -->
> +  <property name="dict.encoding" value="utf-8"/>
> +  <property name="dict.format" value="unidic"/>
> +
>    <property name="dict.normalize" value="false"/>
>    <property name="dict.target.dir" location="./src/resources"/>
>  
> @@ -58,7 +70,8 @@
>  
>    <target name="compile-core" depends="jar-analyzers-common, 
> common.compile-core" />
>    <target name="download-dict" unless="dict.available">
> -     <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
> +     <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
> +     <copy file="${dict.loc.dir}/${dict.src.file}" 
> tofile="${build.dir}/${dict.src.file}"/>
>       <gunzip src="${build.dir}/${dict.src.file}"/>
>       <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
>    </target>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

Reply via email to