[
https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825122#comment-16825122
]
Kazuaki Hiraga edited comment on LUCENE-4056 at 4/24/19 1:00 PM:
-----------------------------------------------------------------
I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for
Japanese full-text information retrieval since the dictionary is well
maintained by researchers of Japanese government funded institute and it
applies stricter rules than IPA dictionary that intends to produce consistent
tokenization results.
was (Author: h.kazuaki):
I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for
Japanese full-text information retrieval since the dictionary is well
maintained by researchers of Japanese government funded institute and applies
stricter rules than IPAdictionary that intend to produce consistent
tokenization results.
> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> ------------------------------------------------------------
>
> Key: LUCENE-4056
> URL: https://issues.apache.org/jira/browse/LUCENE-4056
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 3.6
> Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
> Reporter: Kazuaki Hiraga
> Priority: Major
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr
> 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for
> Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run
> 'ant build-dict', I got the error as the below.
> build-dict:
> [java] dictionary builder
> [java]
> [java] dictionary format: UNIDIC
> [java] input directory:
> /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
> [java] output directory:
> /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
> [java] input encoding: utf-8
> [java] normalize entries: false
> [java]
> [java] building tokeninfo dict...
> [java] parse...
> [java] sort...
> [java] Exception in thread "main" java.lang.AssertionError
> [java] encode...
> [java] at
> org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
> [java] at
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
> [java] at
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
> [java] at
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
> [java] at
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===================================================================
> --- build.xml (revision 1338023)
> +++ build.xml (working copy)
> @@ -28,19 +28,31 @@
> <property name="maven.dist.dir" location="../../../dist/maven" />
>
> <!-- default configuration: uses mecab-ipadic -->
> + <!--
> <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
> <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
> <property name="dict.url"
> value="http://mecab.googlecode.com/files/${dict.src.file}"/>
> + -->
>
> <!-- alternative configuration: uses mecab-naist-jdic
> <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
> <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
> <property name="dict.url"
> value="http://sourceforge.jp/frs/redir.php?m=iij&f=/naist-jdic/53500/${dict.src.file}"/>
> -->
> -
> +
> + <!-- alternative configuration: uses UniDic -->
> + <property name="ipadic.version" value="unidic-mecab1312src" />
> + <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
> + <property name="dict.loc.dir"
> value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
> <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
> + <!--
> <property name="dict.encoding" value="euc-jp"/>
> <property name="dict.format" value="ipadic"/>
> + -->
> + <property name="dict.encoding" value="utf-8"/>
> + <property name="dict.format" value="unidic"/>
> +
> <property name="dict.normalize" value="false"/>
> <property name="dict.target.dir" location="./src/resources"/>
>
> @@ -58,7 +70,8 @@
>
> <target name="compile-core" depends="jar-analyzers-common,
> common.compile-core" />
> <target name="download-dict" unless="dict.available">
> - <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
> + <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
> + <copy file="${dict.loc.dir}/${dict.src.file}"
> tofile="${build.dir}/${dict.src.file}"/>
> <gunzip src="${build.dir}/${dict.src.file}"/>
> <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
> </target>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]