[ 
https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276449#comment-13276449
 ] 

Robert Muir commented on LUCENE-4056:
-------------------------------------

{quote}
Would fixing the dictionary builder for UniDic be a useful starting point in 
your case?
{quote}

That assert from the stacktrace would probably be pretty tricky. Its an 
optimization that works for 
ipadic and naist-jdic, and I knew i was making an assumption doing it, but it 
saves a bunch because
it exploits a redundancy in the model (see LUCENE-3699).

To fix it, this optimization would have to either be conditionalized or pulled 
into a subclass for
ipadic and naist-jdic, and unidic would have to be encoded with a different 
strategy.

Still, unidic support seems pretty tricky to maintain because if we want to 
share any code at all,
there is always the possibility it will break in the future (and due to 
license, not possible to
integrate into automatic tests).

Anyway, thats the background for that particular assert, its my fault but I 
don't have an easy fix!
                
> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> ------------------------------------------------------------
>
>                 Key: LUCENE-4056
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4056
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6
>         Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
>            Reporter: Kazuaki Hiraga
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 
> 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for 
> Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 
> 'ant build-dict', I got the error as the below.
> build-dict:
>      [java] dictionary builder
>      [java] 
>      [java] dictionary format: UNIDIC
>      [java] input directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
>      [java] output directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
>      [java] input encoding: utf-8
>      [java] normalize entries: false
>      [java] 
>      [java] building tokeninfo dict...
>      [java]   parse...
>      [java]   sort...
>      [java] Exception in thread "main" java.lang.AssertionError
>      [java]   encode...
>      [java]   at 
> org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===================================================================
> --- build.xml (revision 1338023)
> +++ build.xml (working copy)
> @@ -28,19 +28,31 @@
>    <property name="maven.dist.dir" location="../../../dist/maven" />
>  
>    <!-- default configuration: uses mecab-ipadic -->
> +  <!--
>    <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" 
> value="http://mecab.googlecode.com/files/${dict.src.file}"/>
> +  -->
>  
>    <!-- alternative configuration: uses mecab-naist-jdic
>    <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" 
> value="http://sourceforge.jp/frs/redir.php?m=iij&amp;f=/naist-jdic/53500/${dict.src.file}"/>
>    -->
> -  
> +
> +  <!-- alternative configuration: uses UniDic -->
> +  <property name="ipadic.version" value="unidic-mecab1312src" />
> +  <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
> +  <property name="dict.loc.dir" 
> value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
>    <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
> +  <!--
>    <property name="dict.encoding" value="euc-jp"/>
>    <property name="dict.format" value="ipadic"/>
> +  -->
> +  <property name="dict.encoding" value="utf-8"/>
> +  <property name="dict.format" value="unidic"/>
> +
>    <property name="dict.normalize" value="false"/>
>    <property name="dict.target.dir" location="./src/resources"/>
>  
> @@ -58,7 +70,8 @@
>  
>    <target name="compile-core" depends="jar-analyzers-common, 
> common.compile-core" />
>    <target name="download-dict" unless="dict.available">
> -     <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
> +     <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
> +     <copy file="${dict.loc.dir}/${dict.src.file}" 
> tofile="${build.dir}/${dict.src.file}"/>
>       <gunzip src="${build.dir}/${dict.src.file}"/>
>       <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
>    </target>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to