[
https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276449#comment-13276449
]
Robert Muir commented on LUCENE-4056:
-------------------------------------
{quote}
Would fixing the dictionary builder for UniDic be a useful starting point in
your case?
{quote}
That assert from the stacktrace would probably be pretty tricky. Its an
optimization that works for
ipadic and naist-jdic, and I knew i was making an assumption doing it, but it
saves a bunch because
it exploits a redundancy in the model (see LUCENE-3699).
To fix it, this optimization would have to either be conditionalized or pulled
into a subclass for
ipadic and naist-jdic, and unidic would have to be encoded with a different
strategy.
Still, unidic support seems pretty tricky to maintain because if we want to
share any code at all,
there is always the possibility it will break in the future (and due to
license, not possible to
integrate into automatic tests).
Anyway, thats the background for that particular assert, its my fault but I
don't have an easy fix!
> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> ------------------------------------------------------------
>
> Key: LUCENE-4056
> URL: https://issues.apache.org/jira/browse/LUCENE-4056
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 3.6
> Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
> Reporter: Kazuaki Hiraga
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr
> 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for
> Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run
> 'ant build-dict', I got the error as the below.
> build-dict:
> [java] dictionary builder
> [java]
> [java] dictionary format: UNIDIC
> [java] input directory:
> /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
> [java] output directory:
> /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
> [java] input encoding: utf-8
> [java] normalize entries: false
> [java]
> [java] building tokeninfo dict...
> [java] parse...
> [java] sort...
> [java] Exception in thread "main" java.lang.AssertionError
> [java] encode...
> [java] at
> org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
> [java] at
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
> [java] at
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
> [java] at
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
> [java] at
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===================================================================
> --- build.xml (revision 1338023)
> +++ build.xml (working copy)
> @@ -28,19 +28,31 @@
> <property name="maven.dist.dir" location="../../../dist/maven" />
>
> <!-- default configuration: uses mecab-ipadic -->
> + <!--
> <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
> <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
> <property name="dict.url"
> value="http://mecab.googlecode.com/files/${dict.src.file}"/>
> + -->
>
> <!-- alternative configuration: uses mecab-naist-jdic
> <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
> <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
> <property name="dict.url"
> value="http://sourceforge.jp/frs/redir.php?m=iij&f=/naist-jdic/53500/${dict.src.file}"/>
> -->
> -
> +
> + <!-- alternative configuration: uses UniDic -->
> + <property name="ipadic.version" value="unidic-mecab1312src" />
> + <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
> + <property name="dict.loc.dir"
> value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
> <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
> + <!--
> <property name="dict.encoding" value="euc-jp"/>
> <property name="dict.format" value="ipadic"/>
> + -->
> + <property name="dict.encoding" value="utf-8"/>
> + <property name="dict.format" value="unidic"/>
> +
> <property name="dict.normalize" value="false"/>
> <property name="dict.target.dir" location="./src/resources"/>
>
> @@ -58,7 +70,8 @@
>
> <target name="compile-core" depends="jar-analyzers-common,
> common.compile-core" />
> <target name="download-dict" unless="dict.available">
> - <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
> + <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
> + <copy file="${dict.loc.dir}/${dict.src.file}"
> tofile="${build.dir}/${dict.src.file}"/>
> <gunzip src="${build.dir}/${dict.src.file}"/>
> <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
> </target>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]