from:"Ingomar Wesp \(JIRA\)"

[jira] [Commented] (SOLR-5332) Add "preserve original" setting to the EdgeNGramFilterFactory

2018-06-12 Thread Ingomar Wesp (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510060#comment-16510060
 ] 

Ingomar Wesp commented on SOLR-5332:


Given that LUCENE-7960 has been closed, I think this issue can be marked as 
fixed, too.

> Add "preserve original" setting to the EdgeNGramFilterFactory
> -
>
> Key: SOLR-5332
> URL: https://issues.apache.org/jira/browse/SOLR-5332
> Project: Solr
>  Issue Type: Wish
>Affects Versions: 4.4, 4.5, 4.5.1, 4.6
>Reporter: Alexander S.
>Priority: Major
> Fix For: 5.2, 6.0
>
>
> Hi, as described here: 
> http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-td4086967.html
>  the problem is in that if you have these 2 strings to index:
> 1. facebook.com/someuser.1
> 2. facebook.com/someveryandverylongusername
> and the edge ngram filter factory with min and max gram size settings 2 and 
> 25, search requests for these urls will fail.
> But search requests for:
> 1. facebook.com/someuser
> 2. facebook.com/someveryandverylonguserna
> will work properly.
> It's because first url has "1" at the end, which is lover than the allowed 
> min gram size. In the second url the user name is longer than the max gram 
> size (27 characters).
> Would be good to have a "preserve original" option, that will add the 
> original string to the index if it does not fit the allowed gram size, so 
> that "1" and "someveryandverylongusername" tokens will also be added to the 
> index.
> Best,
> Alex



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5152) EdgeNGramFilterFactory deletes token

2018-06-12 Thread Ingomar Wesp (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510056#comment-16510056
 ] 

Ingomar Wesp commented on SOLR-5152:


Given that LUCENE-7960 has been closed, I think this issue can be marked as 
fixed as well.

> EdgeNGramFilterFactory deletes token
> 
>
> Key: SOLR-5152
> URL: https://issues.apache.org/jira/browse/SOLR-5152
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 4.4
>Reporter: Christoph Lingg
>Priority: Major
> Attachments: SOLR-5152-v5.0.0.patch, SOLR-5152.patch
>
>
> I am using EdgeNGramFilterFactory in my schema.xml
> {code:xml} positionIncrementGap="100">
>   
> 
>  maxGramSize="10" side="front" />
>   
> {code}
> Some tokens in my index only consist of one character, let's say {{R}}. 
> minGramSize is set to 2 and is bigger than the length of the token. I 
> expected the NGramFilter to left {{R}} unchanged but in fact it is deleting 
> the token.
> For my use case this interpretation is undesirable, and probably for most use 
> cases too!?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-06-04 Thread Ingomar Wesp (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500742#comment-16500742
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

So ... anyone willing to merge this into master?

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-20 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482180#comment-16482180
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Just updated the patch. In the end, I decided against simplifying the filter 
logic, because that would mean that the sequence generated tokens would be a 
bit odd: It would have meant that the original term would always be generated 
first, unless it is in the minGram / maxGram range, which makes the tests a bit 
hard to read. The way it is now, the term is either output before the shortest 
(if it is shorter) or after the longest n-gram (if it is longer), which makes 
more sense IMHO.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-20 Thread Ingomar Wesp (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ingomar Wesp updated LUCENE-7960:
-
Attachment: LUCENE-7960.patch

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-16 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477950#comment-16477950
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Thanks for looking into this. No, I didn't get anything, even when I was 
mentioned directly. Sadly, I'm not adminstrating my own mail server, so I can't 
really rule out that there's some mail transport issue on my end. I'll try 
alternative mail addresses see if that helps.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-15 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476383#comment-16476383
 ] 

Ingomar Wesp edited comment on LUCENE-7960 at 5/15/18 7:40 PM:
---

Sorry for not responding earlier*. Looks good to me; however, the logic can be 
simplified further, now that we don't care about differentiating between the 
two cases anymore. Unless someone else wants  to address this and Robert's 
other suggestions, I would like to further refine the patch and submit a new 
one this weekend.

*) Even though I'm watching this issue, I'm not getting mails from Jira. Is 
this intentional for non-commiters?


was (Author: iwesp):
Sorry for not responding earlier*. Looks good to me; however, the logic can be 
simplified further, now that we don't care about differentiating between the 
two cases anymore. Unless someone else wants  to address this and Robert's 
other suggestions, I would like to further refine the patch and submit a new 
one this weekend.

 

*) Even though I'm watching this issue, I'm not getting mails from Jira. Is 
this intentional for non-commiters?

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-15 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476383#comment-16476383
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Sorry for not responding earlier*. Looks good to me; however, the logic can be 
simplified further, now that we don't care about differentiating between the 
two cases anymore. Unless someone else wants  to address this and Robert's 
other suggestions, I would like to further refine the patch and submit a new 
one this weekend.

 

*) Even though I'm watching this issue, I'm not getting mails from Jira. Is 
this intentional for non-commiters?

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-03 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461976#comment-16461976
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

I understand your concern. As far as I can tell, there are two options:

1) Replace the booleans with an enum that covers the four possible 
combinations. Maybe "keepMode" with values "DROP", "KEEP_SHORT_TERM", 
"KEEP_LONG_TERM", "KEEP_ALL". I'm not really happy with the names, though - 
advice welcome.

2) Fold the two booleans into one - maybe "preserveOriginal", akin to how the 
corresponding attribute in other filters is called.

I personally prefer 1), but I'd happily adapt the patch to implement the other 
if it makes things easier from your perspective.



> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461325#comment-16461325
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Thanks a lot for your support. I don't quite understand your comment regarding 
the constructors: Unless I'm missing something, I think I _did_ preserve the 
original ones, which now delegate to the new ctors using defaullt values.

Is there anything left that I can or should do to get this into master?

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-04-25 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453005#comment-16453005
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

I've just updated the patch in PR #362. I now also have a working patch for 
NGramTokenizer and EdgeNGramTokenizer in a separate branch, but it's still 
pretty messy and thus not yet ready for a PR.

Since this issue deals with the filters specifically, could someone could have 
a look at PR #362 and merge it if it's acceptable? Once this is done, I would 
then open another issue for the tokenizers.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-04-04 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426221#comment-16426221
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Ok, I just added the same paramteters to the NGramTokenFilter and updated the 
pull request. In the long term, it probably makes sense to move all the logic 
into the NGramTokenFilter and turn EdgeNGramTokenFilter into a simple wrapper. 
EdgeNGramTokenizer is already implemented this way.

I presume it also makes sense to extend NGramTokenizer and EdgeNGramTokenizer 
accordingly?

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-04-01 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421621#comment-16421621
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Thanks for your feedback! Yes, this is exactly the behavior I tried to 
implement. I'll rename the parameters according to your suggestion.

I will also have a look at the other ngram components as soon as I have time.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-03-31 Thread Ingomar Wesp (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421475#comment-16421475
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

I'd like to propose a patch (see attached pull request #349) that adds two 
options to the EdgeNGramFilter:
 * keepShortTerms: Causes the filter to pass through input terms that are 
shorter than the minimum gram size.
 * keepLongTerms: Causes the filter to pass through input terms that are longer 
than the maximum gram size.

I'm not entirely sure about the usefulness of keepLongTerms, but enabling the 
ability pass through short terms would certainly be neat for queries where 
you'd like to match ALL tokens as either prefixes or exact terms, but some 
query tokens are shorter than the minimum gram size. As far is I understand, a 
second field containing the exact terms isn't really a viable alternative 
there, because you can easily run into situations where only a subset of query 
tokens matches for either field.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5332) Add "preserve original" setting to the EdgeNGramFilterFactory

[jira] [Commented] (SOLR-5152) EdgeNGramFilterFactory deletes token

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Updated] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Comment Edited] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

14 matches

Site Navigation

Mail list logo

Footer information