Steve Rowe commented on LUCENE-4065:

Based on [Jim Ferenczi's comment on 
 I created a failing test for StopFilter that shows that StopFilter can (still) 
corrupt the token stream - the failure message says that "walking" gets a 
posinc of 1 instead of 2, which means that the only way to interpret the "twd" 
token's poslen of 3 is as a trailing gap, which is misplaced:

  public void testLeadingStopwordSynonymGraph() throws Exception {
    SynonymMap.Builder builder = new SynonymMap.Builder(true);
    builder.add(new CharsRef("twd"), new 
CharsRef("the\u0000walking\u0000dead"), true);
    final SynonymMap synonymMap = builder.build();

    Analyzer analyzer = new Analyzer() {
      protected TokenStreamComponents createComponents(String fieldName) {
        MockTokenizer tokenizer = new MockTokenizer();
        TokenStream stream = new SynonymGraphFilter(tokenizer, synonymMap, 
        stream = new StopFilter(stream, 
        return new TokenStreamComponents(tokenizer, stream);
    TokenStream tokenStream = analyzer.tokenStream("field", "twd");
        new String[] { "twd", "walking", "dead" },
        null, null, 
        new int[]    { 1,     2,         1      },  // posinc
        new int[]    { 3,     1,         1      },  // poslen

> FilteringTokenFilter should never corrupt the tokenstream graph
> ---------------------------------------------------------------
>                 Key: LUCENE-4065
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4065
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: LUCENE-4065_test.patch
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to