[
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373703#comment-16373703
]
Steve Rowe commented on LUCENE-4065:
------------------------------------
Based on [Jim Ferenczi's comment on
SOLR-11968|https://issues.apache.org/jira/browse/SOLR-11968?focusedCommentId=16373554&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16373554],
I created a failing test for StopFilter that shows that StopFilter can (still)
corrupt the token stream - the failure message says that "walking" gets a
posinc of 1 instead of 2, which means that the only way to interpret the "twd"
token's poslen of 3 is as a trailing gap, which is misplaced:
{code:java|title=TestStopFilterFactory.java}
public void testLeadingStopwordSynonymGraph() throws Exception {
SynonymMap.Builder builder = new SynonymMap.Builder(true);
builder.add(new CharsRef("twd"), new
CharsRef("the\u0000walking\u0000dead"), true);
final SynonymMap synonymMap = builder.build();
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
MockTokenizer tokenizer = new MockTokenizer();
TokenStream stream = new SynonymGraphFilter(tokenizer, synonymMap,
true);
stream = new StopFilter(stream,
CharArraySet.copy(Collections.singleton("the")));
return new TokenStreamComponents(tokenizer, stream);
}
};
TokenStream tokenStream = analyzer.tokenStream("field", "twd");
assertTokenStreamContents(tokenStream,
new String[] { "twd", "walking", "dead" },
null, null,
new int[] { 1, 2, 1 }, // posinc
new int[] { 3, 1, 1 }, // poslen
null);
}
{code}
> FilteringTokenFilter should never corrupt the tokenstream graph
> ---------------------------------------------------------------
>
> Key: LUCENE-4065
> URL: https://issues.apache.org/jira/browse/LUCENE-4065
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Priority: Major
> Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an
> enum), where you still get a legit tokenstream and dont totally screw it up,
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to
> not begin the tokenstream with posinc=0)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]