Mike Sokolov created LUCENE-8863:
------------------------------------

             Summary: Improve handling of edge cases in Kuromoji's 
DIctionaryBuilder
                 Key: LUCENE-8863
                 URL: https://issues.apache.org/jira/browse/LUCENE-8863
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Mike Sokolov
            Assignee: Mike Sokolov


While building a custom Kuromoji system dictionary, I discovered a few issues.

First, the dictionary encoding has room for 13-bit (left and right) ids, but 
really only supports 12 bits since this was all that was needed for the IPADIC 
dictionary that ships with Kuromoji. The good news is we can easily add support 
by fixing the bit-twiddling math.

Second, the dictionary builder has a number of assertions that help uncover 
problems in the input (like these overlarge ids), but the assertions aren't 
enabled by default, so an unsuspecting new user doesn't get any benefit from 
them, so we should upgrade to "real" exceptions.

Finally, we want to handle the case of empty base forms differently. Kuromoji 
does stemming by substituting a base form for a word when there is a base form 
in the dictionary. Missing base forms are expected to be supplied as {{*}}, but 
if a dictionary provides an empty string base form, we would end up stripping 
that token completely. Since there is no possible meaning for an empty base 
form (and the dictionary builder already treats {{*}} and empty strings as 
equivalent in a number of other cases), I think we should simply ignore empty 
base forms (rather than replacing words with empty strings when tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to