[ https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867628#comment-16867628 ]
Tomoko Uchida commented on LUCENE-8863: --------------------------------------- Thanks [~sokolov], I've not tried the branch yet but I think I understand the intention in your commit. With a bit of delay, I just started learning kuromoji code and the source tree structure. I will open an issue soon to make it possible to build the dictionary as a separated jar, which can be loaded by the newly added constructor here. (The patch may include a new ant task and some refactoring of the source tree structure.) > Improve handling of edge cases in Kuromoji's DIctionaryBuilder > -------------------------------------------------------------- > > Key: LUCENE-8863 > URL: https://issues.apache.org/jira/browse/LUCENE-8863 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Mike Sokolov > Assignee: Mike Sokolov > Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > While building a custom Kuromoji system dictionary, I discovered a few issues. > First, the dictionary encoding has room for 13-bit (left and right) ids, but > really only supports 12 bits since this was all that was needed for the > IPADIC dictionary that ships with Kuromoji. The good news is we can easily > add support by fixing the bit-twiddling math. > Second, the dictionary builder has a number of assertions that help uncover > problems in the input (like these overlarge ids), but the assertions aren't > enabled by default, so an unsuspecting new user doesn't get any benefit from > them, so we should upgrade to "real" exceptions. > Finally, we want to handle the case of empty base forms differently. Kuromoji > does stemming by substituting a base form for a word when there is a base > form in the dictionary. Missing base forms are expected to be supplied as > {{*}}, but if a dictionary provides an empty string base form, we would end > up stripping that token completely. Since there is no possible meaning for an > empty base form (and the dictionary builder already treats {{*}} and empty > strings as equivalent in a number of other cases), I think we should simply > ignore empty base forms (rather than replacing words with empty strings when > tokenizing!) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org