[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

Robert Muir (JIRA) Tue, 08 May 2018 17:30:28 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16468156#comment-16468156
 ]


Robert Muir commented on LUCENE-7960:
-------------------------------------

The patch has a little confusion about back compat (e.g. breaks back compat 
with the factories by requiring parameters that were optional before, but 
leaves back compat in the tokenfilters), so I'm not sure if its geared at the 
master branch or not. Sometimes its easiest to make the patch with all the 
back-compat, commit it to master and merge it back, then make a separate commit 
to just master to remove the cruft, maybe its good in this case.

There are some cosmetic style changes such as moving attribute initialization 
into the ctor instead of inline, that is different than the style of all our 
other tokenfilters. It makes it hard to review the logic changes (have not 
looked at this, just the apis and docs).

As far as docs, I think there are easy wins. Lets take EdgeNGramTokenFilter 
just as an example.

For the ctor with all the parameters, it doesn't need to have documentation on 
what the other ctors do: they can have their own. It should only document the 
behavior and parameters like it does, so we can just remove its last line about 
that.

For the other ctors which are shortcuts/sugar, we can add a line such as this:
{code}
   * <p>
   * Behaves the same as {@link #EdgeNGramTokenFilter(TokenStream, int, int, 
boolean) 
   *                             EdgeNGramTokenFilter(input, minGram, maxGram, 
false)}
{code}

This helps make it clear what the shortcut/sugar is really doing with a 
clickable link, and it also helps the deprecated case, if someone has to 
transition their code.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7960
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7960
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>         Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

Reply via email to