[jira] [Comment Edited] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

Mike Sokolov (JIRA) Sat, 15 Jun 2019 12:57:24 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864841#comment-16864841
 ]


Mike Sokolov edited comment on LUCENE-8863 at 6/15/19 7:56 PM:
---------------------------------------------------------------

{quote}Can we just throw an exception on empty base form? It sounds like a 
missing check in the code. I don't think its good to try to support N different 
ways of doing things when only one is tested (the ipadic way)
{quote}
I agree with the sentiment - be strict, and keep it simple. One thing is we 
already handle empty POS fields by ignoring. EG in the section where it says 
"build up the POS string" we concatenate various POS tokens with "-" as a 
separator, unless they are empty, and then we don't add adjacent separator 
chars. The other thing is – I don't know what dictionaries may already exist? 
Is there an externally-defined standard we would should accept? I can certainly 
modify the dictionary I have to have "*," but what about Unidic or dictionaries 
people might get from Sudachi or other neologd providers? If there is some 
common usage that expects empty strings, I think we should support it, and it 
really is kind of natural to express a missing value with an empty string? Are 
there people here who have looked at those?

 

 [Here's a link to a preliminary patch 
|https://github.com/apache/lucene-solr/pull/722](no tests yet)


was (Author: sokolov):
{quote}Can we just throw an exception on empty base form? It sounds like a 
missing check in the code. I don't think its good to try to support N different 
ways of doing things when only one is tested (the ipadic way)
{quote}
I agree with the sentiment - be strict, and keep it simple. One thing is we 
already handle empty POS fields by ignoring. EG in the section where it says 
"build up the POS string" we concatenate various POS tokens with "-" as a 
separator, unless they are empty, and then we don't add adjacent separator 
chars. The other thing is – I don't know what dictionaries may already exist? 
Is there an externally-defined standard we would should accept? I can certainly 
modify the dictionary I have to have "*," but what about Unidic or dictionaries 
people might get from Sudachi or other neologd providers? If there is some 
common usage that expects empty strings, I think we should support it, and it 
really is kind of natural to express a missing value with an empty string? Are 
there people here who have looked at those?

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --------------------------------------------------------------
>
>                 Key: LUCENE-8863
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8863
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Assignee: Mike Sokolov
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

Reply via email to