[jira] [Commented] (LUCENE-8036) ShingleFilter should have an option to skip filler tokens (e.g. stop words)

Adrien Grand (JIRA) Mon, 20 Nov 2017 09:51:26 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259555#comment-16259555
 ]


Adrien Grand commented on LUCENE-8036:
--------------------------------------

I dislike a bit the fact that this option would mean that term queries on 
shingles no longer have the same matches as a phrase query with a slop of zero.

If you want to do this, ideally you should use a tokenizer that does not emit 
stop words. We used to have options in StopFilter to remove positions of stop 
words but this happened to break token streams (eg. in case of multi-word 
synonyms where one of the sub words is a stop word).

> ShingleFilter should have an option to skip filler tokens (e.g. stop words)
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-8036
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8036
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 7.1
>            Reporter: Edans Sandes
>            Priority: Trivial
>              Labels: ShingleFilter, StopFilter, StopWords
>         Attachments: SOLR-11604.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> ShingleFilterFactory should have an option to ignore filler tokens in the 
> total shingle size. 
> For instance (adapted from 
> [https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs]),
>  consider the text "A brown fox quickly jumps over the lazy dog". When we 
> remove stopwords and execute the ShingleFilter (shingle size = 3), it gives 
> us the following result:
> 1. _ brown fox
> 2. brown fox quickly
> 3. fox quickly jump
> 4. quickly jump _
> 5. jump _ _
> 6. _ _ lazy
> 7. _ lazy dog
> We can clearly see that the filler token "_" occupies one token in the 
> shingle.
> I suppose the returned shingles should be:
> 1. brown fox quickly
> 2. fox quickly jump
> 3. quickly jump lazy
> 4. jump lazy dog
> To maintain backward compatibility, i suggest the creation of an option 
> called "skipFillerTokens" to implement this behavior (note that this is 
> different than using fillerTokens="", since the empty string occupies one 
> token in the shingle)
> I've attached a patch for the ShingleFilter class (getNextToken() method), 
> ShingleFilterFactory and ShingleFilterTest clases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8036) ShingleFilter should have an option to skip filler tokens (e.g. stop words)

Reply via email to