[ https://issues.apache.org/jira/browse/LUCENE-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicolás Lichtmaier updated LUCENE-8723: --------------------------------------- Description: I was debugging an issue (missing tokens after analysis) and when I enabled Java assertions I uncovered a bug when using WordDelimiterGraphFilter + StopFilter + FlattenGraphFilter. I could reproduce the issue in a small piece of code. This code gives an assertion failure when assertions are enabled (-ea java option): {code:java} Builder builder = CustomAnalyzer.builder(); builder.withTokenizer(StandardTokenizerFactory.class); builder.addTokenFilter(WordDelimiterGraphFilterFactory.class, "preserveOriginal", "1"); builder.addTokenFilter(StopFilterFactory.class); builder.addTokenFilter(FlattenGraphFilterFactory.class); Analyzer analyzer = builder.build(); TokenStream ts = analyzer.tokenStream("*", new StringReader("x7in")); ts.reset(); while(ts.incrementToken()) ; {code} This gives: {code} Exception in thread "main" java.lang.AssertionError: 2 at org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195) at org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258) at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32) {code} Maybe removing stop words after WordDelimiterGraphFilter is wrong, I don't know. However is the only way to process stop-words generated by that filter. In any case, it should not eat tokens or produce assertions. was: I was debugging an issue (missing tokens after analysis) and when I enabled Java assertions I uncovered a bug when using WordDelimiterGraphFilter + StopFilter + FlattenGraphFilter. I could reproduce the issue in a small piece of code. This code gives an assertion failure when assertions are enabled (-ea java option): {code:java} Builder builder = CustomAnalyzer.builder(); builder.withTokenizer(StandardTokenizerFactory.class); builder.addTokenFilter(WordDelimiterGraphFilterFactory.class, "preserveOriginal", "1"); builder.addTokenFilter(StopFilterFactory.class); builder.addTokenFilter(FlattenGraphFilterFactory.class); Analyzer analyzer = builder.build();}} TokenStream ts = analyzer.tokenStream("*", new StringReader("x7in")); ts.reset(); while(ts.incrementToken()) ; {code} This gives: {code} Exception in thread "main" java.lang.AssertionError: 2 at org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195) at org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258) at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32) {code} Maybe removing stop words after WordDelimiterGraphFilter is wrong, I don't know. However is the only way to process stop-words generated by that filter. In any case, it should not eat tokens or produce assertions. > Bad interaction bewteen WordDelimiterGraphFilter, StopFilter and > FlattenGraphFilter > ----------------------------------------------------------------------------------- > > Key: LUCENE-8723 > URL: https://issues.apache.org/jira/browse/LUCENE-8723 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 7.7.1 > Reporter: Nicolás Lichtmaier > Priority: Major > > I was debugging an issue (missing tokens after analysis) and when I enabled > Java assertions I uncovered a bug when using WordDelimiterGraphFilter + > StopFilter + FlattenGraphFilter. > I could reproduce the issue in a small piece of code. This code gives an > assertion failure when assertions are enabled (-ea java option): > {code:java} > Builder builder = CustomAnalyzer.builder(); > builder.withTokenizer(StandardTokenizerFactory.class); > builder.addTokenFilter(WordDelimiterGraphFilterFactory.class, > "preserveOriginal", "1"); > builder.addTokenFilter(StopFilterFactory.class); > builder.addTokenFilter(FlattenGraphFilterFactory.class); > Analyzer analyzer = builder.build(); > > TokenStream ts = analyzer.tokenStream("*", new StringReader("x7in")); > ts.reset(); > while(ts.incrementToken()) > ; > {code} > This gives: > {code} > Exception in thread "main" java.lang.AssertionError: 2 > at > org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195) > at > org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258) > at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32) > {code} > Maybe removing stop words after WordDelimiterGraphFilter is wrong, I don't > know. However is the only way to process stop-words generated by that filter. > In any case, it should not eat tokens or produce assertions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org