[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

Shawn Heisey (JIRA) Tue, 08 May 2018 19:14:24 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16468227#comment-16468227
 ]


Shawn Heisey commented on LUCENE-7960:
--------------------------------------

All good points.

I made the min/max parameters required on the factory because the constructor 
without any size parameters is deprecated.  Is this something you don't like at 
all, or something you would only want to see in master?  You're right that 
there is some backcompat confusion in that.  I wasn't completely sure it was a 
good idea, decided to proceed, with the probability of not making that change 
in the backport to 7x.

If we do need to keep default values, the current defaults (1 and 2 for ngram, 
1 and 1 for edgengram) really kinda suck.  But changing them is another 
backcompat break.  Seems better to completely get rid of defaults in master, 
and unless I misunderstood you, that seems to be your position as well.

I would need to review the patch to be sure, but I think that the cosmetic 
style changes you mentioned were made by the contributor before I started 
working on it.  The changes to the inner workings of the filter looked like 
they were doing more than the issue requires, but I don't understand token 
handling code well enough to grasp what the changes were actually doing.  Tests 
pass, so if there's a problem there, it's not being caught by test coverage.  
Please feel free to adjust or make suggestions!

I'm done in for the night, and will poke at it again tomorrow.  If you could 
summarize everything that you'd like to see in a new patch, that would help me.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7960
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7960
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>         Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

Reply via email to