subject:"\[jira\] \[Commented\] \(LUCENE\-7960\) NGram filters \-\- preserve the original token when it is outside the min\/max size range"

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-06-05 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502061#comment-16502061
 ] 

ASF subversion and git services commented on LUCENE-7960:
-

Commit 3694bbdaaf9f9b094db364c892cd707facb0d680 in lucene-solr's branch 
refs/heads/branch_7x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3694bbd ]

LUCENE-7960: fix Solr test to include mandatory args

(cherry picked from commit c587598)


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Assignee: Robert Muir
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-06-05 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502057#comment-16502057
 ] 

ASF subversion and git services commented on LUCENE-7960:
-

Commit c587598096cde769c299594fb26d0a23b7bd5930 in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c587598 ]

LUCENE-7960: fix Solr test to include mandatory args


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Assignee: Robert Muir
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-06-04 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501211#comment-16501211
 ] 

ASF subversion and git services commented on LUCENE-7960:
-

Commit 5c6a49b13f47789c828995f747ec541810bdd0b4 in lucene-solr's branch 
refs/heads/master from [~rcmuir]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5c6a49b ]

LUCENE-7960: remove deprecations


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-06-04 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501163#comment-16501163
 ] 

ASF subversion and git services commented on LUCENE-7960:
-

Commit 98bf43b3da5131f0d27c747ac8bfbe28945cc922 in lucene-solr's branch 
refs/heads/branch_7x from [~rcmuir]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=98bf43b ]

LUCENE-7960: Add preserveOriginal option to the NGram and EdgeNGram filters

(this is a correction of the issue number in both the CHANGES.txt and the 
commit message, sorry for the noise).


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-06-04 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501130#comment-16501130
 ] 

ASF subversion and git services commented on LUCENE-7960:
-

Commit 208d4a9c346ab0dca6c4ae659d55b9446b7d8c87 in lucene-solr's branch 
refs/heads/master from [~rcmuir]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=208d4a9 ]

LUCENE-7960: Add preserveOriginal option to the NGram and EdgeNGram filters

(this is a correction of the issue number in both the CHANGES.txt and the 
commit message, sorry for the noise).


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-06-04 Thread Ingomar Wesp (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500742#comment-16500742
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

So ... anyone willing to merge this into master?

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-22 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483721#comment-16483721
 ] 

Robert Muir commented on LUCENE-7960:
-

looks good. thank you for making the updates.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-20 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482180#comment-16482180
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Just updated the patch. In the end, I decided against simplifying the filter 
logic, because that would mean that the sequence generated tokens would be a 
bit odd: It would have meant that the original term would always be generated 
first, unless it is in the minGram / maxGram range, which makes the tests a bit 
hard to read. The way it is now, the term is either output before the shortest 
(if it is shorter) or after the longest n-gram (if it is longer), which makes 
more sense IMHO.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-16 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477950#comment-16477950
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Thanks for looking into this. No, I didn't get anything, even when I was 
mentioned directly. Sadly, I'm not adminstrating my own mail server, so I can't 
really rule out that there's some mail transport issue on my end. I'll try 
alternative mail addresses see if that helps.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476817#comment-16476817
 ] 

Robert Muir commented on LUCENE-7960:
-

{quote}
*) Even though I'm watching this issue, I'm not getting mails from Jira. Is 
this intentional for non-commiters?
{quote}

As far as I know, JIRA doesn't consider any roles. This is what the 
configuration says:

|Issue Commented| * All Watchers
 * Current Assignee
 * Reporter
 * Single Email Address (dev at lucene.apache.org)|

I added you to Contributors group so you can assign issues: maybe it helps. But 
it could be something SMTP-related or some other problem. Did you get any 
notifications when Shawn mentioned you on this issue?

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-15 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476383#comment-16476383
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Sorry for not responding earlier*. Looks good to me; however, the logic can be 
simplified further, now that we don't care about differentiating between the 
two cases anymore. Unless someone else wants  to address this and Robert's 
other suggestions, I would like to further refine the patch and submit a new 
one this weekend.

 

*) Even though I'm watching this issue, I'm not getting mails from Jira. Is 
this intentional for non-commiters?

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-08 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468292#comment-16468292
 ] 

Robert Muir commented on LUCENE-7960:
-

{quote}
I made the min/max parameters required on the factory because the constructor 
without any size parameters is deprecated. Is this something you don't like at 
all, or something you would only want to see in master?
{quote}

what does it mean "not making that change in the backport to 7x" ?
As i suggested above: consider making the patch against master fully backwards 
compatible. We can review it, then it can be committed, merged cleanly and 
safely back to 7.x. After that, remove the deprecations in master in a separate 
dedicated commit.

It seems like more work, but I think its less work than trying to do a 
shortcut, because you can have confidence you don't break stuff. "Making 
changes during backports" seems like trouble, and having a confusing patch 
makes the code review hard. The current one is confusing because it isn't 
really appropriate for either master (it has deprecations) nor 7x (it breaks 
backwards)


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-08 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468227#comment-16468227
 ] 

Shawn Heisey commented on LUCENE-7960:
--

All good points.

I made the min/max parameters required on the factory because the constructor 
without any size parameters is deprecated.  Is this something you don't like at 
all, or something you would only want to see in master?  You're right that 
there is some backcompat confusion in that.  I wasn't completely sure it was a 
good idea, decided to proceed, with the probability of not making that change 
in the backport to 7x.

If we do need to keep default values, the current defaults (1 and 2 for ngram, 
1 and 1 for edgengram) really kinda suck.  But changing them is another 
backcompat break.  Seems better to completely get rid of defaults in master, 
and unless I misunderstood you, that seems to be your position as well.

I would need to review the patch to be sure, but I think that the cosmetic 
style changes you mentioned were made by the contributor before I started 
working on it.  The changes to the inner workings of the filter looked like 
they were doing more than the issue requires, but I don't understand token 
handling code well enough to grasp what the changes were actually doing.  Tests 
pass, so if there's a problem there, it's not being caught by test coverage.  
Please feel free to adjust or make suggestions!

I'm done in for the night, and will poke at it again tomorrow.  If you could 
summarize everything that you'd like to see in a new patch, that would help me.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-08 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468161#comment-16468161
 ] 

Robert Muir commented on LUCENE-7960:
-

Also for the full ctor that allows a range, i really still think it needs some 
wording, a warning of sorts, that a big range is really the same 
(space/time-wise) as indexing the content N different ways. It may be also good 
to include the fact that if you pass {{true}} for preserveOriginal, its like 
indexing the content yet another time.

The ctor that just takes a fixed "n" for the n-gram doesn't need such warnings, 
its pretty safe.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-08 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468156#comment-16468156
 ] 

Robert Muir commented on LUCENE-7960:
-

The patch has a little confusion about back compat (e.g. breaks back compat 
with the factories by requiring parameters that were optional before, but 
leaves back compat in the tokenfilters), so I'm not sure if its geared at the 
master branch or not. Sometimes its easiest to make the patch with all the 
back-compat, commit it to master and merge it back, then make a separate commit 
to just master to remove the cruft, maybe its good in this case.

There are some cosmetic style changes such as moving attribute initialization 
into the ctor instead of inline, that is different than the style of all our 
other tokenfilters. It makes it hard to review the logic changes (have not 
looked at this, just the apis and docs).

As far as docs, I think there are easy wins. Lets take EdgeNGramTokenFilter 
just as an example.

For the ctor with all the parameters, it doesn't need to have documentation on 
what the other ctors do: they can have their own. It should only document the 
behavior and parameters like it does, so we can just remove its last line about 
that.

For the other ctors which are shortcuts/sugar, we can add a line such as this:
{code}
   * 
   * Behaves the same as {@link #EdgeNGramTokenFilter(TokenStream, int, int, 
boolean) 
   * EdgeNGramTokenFilter(input, minGram, maxGram, 
false)}
{code}

This helps make it clear what the shortcut/sugar is really doing with a 
clickable link, and it also helps the deprecated case, if someone has to 
transition their code.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-08 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468129#comment-16468129
 ] 

Shawn Heisey commented on LUCENE-7960:
--

Uploading new patch.  [~iwesp], please make sure I haven't broken anything, 
including test coverage.  Tests and precommit do pass.  [~rcmuir], can you look 
it over too?


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-08 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467968#comment-16467968
 ] 

Robert Muir commented on LUCENE-7960:
-

then it would behave like you expect an n-gram filter to behave? min=max=4 or 
whatever. The two ints is really crazy/expert and doesn't match anyone's 
expectations about what n-grams are. Its also mega trappy as i mentioned above, 
it needs javadoc warnings.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-08 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467964#comment-16467964
 ] 

Shawn Heisey commented on LUCENE-7960:
--

What would happen when there's only one int in the constructor?

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-08 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467894#comment-16467894
 ] 

Robert Muir commented on LUCENE-7960:
-

Yes, I think we should deprecate. It helps ppl upgrade and shouldn't be too bad 
in this case.

If we currently have 1-arg (TokenStream) and 3-arg (TokenStream, int, int), and 
we want to end up at 2-arg (TokenStream, int) and 4-arg (TokenStream, int, int, 
boolean) then 7.x can temporarily have 4 constructors: the existing two of 
which are deprecated and forward to the new ones. Their javadoc can even 
explain what the forwarding is doing. master would just have the two new ones 
with no cruft.


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-08 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467599#comment-16467599
 ] 

David Smiley commented on LUCENE-7960:
--

A new preserveOriginal will be nice, and also consistent with other filters 
that have this.
I agree with Rob about the defaults; users should explicitly state the gram 
length(s).

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-06 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465233#comment-16465233
 ] 

Shawn Heisey commented on LUCENE-7960:
--

Thanks for the clarification.  Should the no-arg constructor go through 
deprecation in 7.x?


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-06 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465221#comment-16465221
 ] 

Robert Muir commented on LUCENE-7960:
-

There is no need to have only one constructor: two many parameters for the 
simple use case.

I already explained my preference as to what they should be:
* NgramWhateverFilter(TokenStream, int)
* NgramWhateverFilter(TokenStream, int, int, boolean)

So remove the no-arg constructor, which means there is no need for any default 
min/max.
It is also important that the factory match this. Whatever parameters are 
mandatory for the tokenfilter also needs to be mandatory in the factory, too. I 
will insist on it.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-06 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465199#comment-16465199
 ] 

Shawn Heisey commented on LUCENE-7960:
--

OK, so we re-engineer to only add a preserveOriginal parameter.  That parameter 
will keep original term when it is outside the min/max range.

For addressing the traps: Is that just removing the no-arg constructor, 
changing the default min/max, both, or was there something else you had in mind?

In master, what constructors do you think should be there?  My bias is to only 
have one, but I don't live and breathe Lucene code like you do, so I trust your 
judgement more than mine.


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

23 matches

Site Navigation

Mail list logo

Footer information