[
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868999#comment-16868999
]
ASF subversion and git services commented on LUCENE-8863:
---------------------------------------------------------
Commit 4502065f03654af204f23d7c90ee95c28d97f987 in lucene-solr's branch
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4502065 ]
LUCENE-8863: enhance Kuromoji DictionaryBuilder tool
added tests
enabled ids up to 8191
support loading custom system dictionary from filesystem or classpath
> Improve Kuromoji DictionaryBuilder error handling, and enable loading
> external dictionary for testing
> ------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Mike Sokolov
> Assignee: Mike Sokolov
> Priority: Major
> Time Spent: 3h
> Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but
> really only supports 12 bits since this was all that was needed for the
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover
> problems in the input (like these overlarge ids), but the assertions aren't
> enabled by default, so an unsuspecting new user doesn't get any benefit from
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji
> does stemming by substituting a base form for a word when there is a base
> form in the dictionary. Missing base forms are expected to be supplied as
> {{*}}, but if a dictionary provides an empty string base form, we would end
> up stripping that token completely. Since there is no possible meaning for an
> empty base form (and the dictionary builder already treats {{*}} and empty
> strings as equivalent in a number of other cases), I think we should simply
> ignore empty base forms (rather than replacing words with empty strings when
> tokenizing!)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]