[
https://issues.apache.org/jira/browse/LUCENE-8036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270188#comment-16270188
]
David Smiley commented on LUCENE-8036:
--------------------------------------
It isn't clear to me why the setPreservePositionIncrements=false option was
removed (in LUCENE-4963 without a reason) but that seems to be water under the
bridge now. I remember folks complaining. The ramification of it's absence is
that any down-stream consumer needs an option to toggle it, like
org.apache.lucene.util.QueryBuilder#setEnablePositionIncrements and
org.apache.lucene.search.suggest.analyzing.FuzzySuggester's constructor and
org.apache.lucene.search.suggest.document.CompletionAnalyzer's constructor, and
perhaps elsewhere. Now apparently shingle could use it too :-/. _Perhaps a
compromise to the absence of the old boolean on StopFilter might be a new
filter that sets posInc to 1?_ But even that begs the question of why wouldn't
such a thing be native to StopFilter. Your suggestion of "a tokenizer that
does not emit stop words" seems inflexible as it requires a custom tokenizer
and wouldn't allow the flexibility of putting the StopFilter at the right spot
in the chain (e.g. after WordDelimiterFilter).
> ShingleFilter should have an option to skip filler tokens (e.g. stop words)
> ---------------------------------------------------------------------------
>
> Key: LUCENE-8036
> URL: https://issues.apache.org/jira/browse/LUCENE-8036
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 7.1
> Reporter: Edans Sandes
> Priority: Trivial
> Labels: ShingleFilter, StopFilter, StopWords
> Attachments: SOLR-11604.patch
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> ShingleFilterFactory should have an option to ignore filler tokens in the
> total shingle size.
> For instance (adapted from
> [https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs]),
> consider the text "A brown fox quickly jumps over the lazy dog". When we
> remove stopwords and execute the ShingleFilter (shingle size = 3), it gives
> us the following result:
> 1. _ brown fox
> 2. brown fox quickly
> 3. fox quickly jump
> 4. quickly jump _
> 5. jump _ _
> 6. _ _ lazy
> 7. _ lazy dog
> We can clearly see that the filler token "_" occupies one token in the
> shingle.
> I suppose the returned shingles should be:
> 1. brown fox quickly
> 2. fox quickly jump
> 3. quickly jump lazy
> 4. jump lazy dog
> To maintain backward compatibility, i suggest the creation of an option
> called "skipFillerTokens" to implement this behavior (note that this is
> different than using fillerTokens="", since the empty string occupies one
> token in the shingle)
> I've attached a patch for the ShingleFilter class (getNextToken() method),
> ShingleFilterFactory and ShingleFilterTest clases.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]