[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867628#comment-16867628
 ] 

Tomoko Uchida commented on LUCENE-8863:
---------------------------------------

Thanks [~sokolov],
I've not tried the branch yet but I think I understand the intention in your 
commit. 

With a bit of delay, I just started learning kuromoji code and the source tree 
structure. I will open an issue soon to make it possible to build the 
dictionary as a separated jar, which can be loaded by the newly added 
constructor here. (The patch may include a new ant task and some refactoring of 
the source tree structure.)



> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --------------------------------------------------------------
>
>                 Key: LUCENE-8863
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8863
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Assignee: Mike Sokolov
>            Priority: Major
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to