Hello - i noticed something peculiar running Lucene/Solr 6.3.0. The plural vaccinatieprogramma's should have a startOffset of 0 and a endOffset of 21 when passed through WordDelimiterFilter and/or stemmers but it isn't, slightly messing up highlighted terms.
wdf = new WordDelimiterFilter(new CannedTokenStream(new Token("vaccinatieprogramma's", 0, 21)), DEFAULT_WORD_DELIM_TABLE, flags, null); assertTokenStreamContents(wdf, new String[] { "vaccinatieprogramma"}, new int[] { 0 }, new int[] { 21 }); [junit4] Suite: org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestWordDelimiterFilter -Dtests.method=testOffsets -Dtests.seed=21AB10650E10CEB9 -Dtests.slow=true -Dtests.locale=bg-BG -Dtests.timezone=Etc/GMT+10 -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] FAILURE 0.06s | TestWordDelimiterFilter.testOffsets <<< [junit4] > Throwable #1: java.lang.AssertionError: endOffset 0 expected:<21> but was:<19> I would expect the same behaviour a stemmers, the length of the term is always the length of the original term. So if a user queries for a sigular term, the whole plural (original) is highlighted. Am i missing something? Bug? Thanks, Markus --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org