You provided a list of TokenFilters that you use in your Analyzer, but you didn't mention anything about what Tokenizer you are using.
You also mentioned seeing a difference in the "tokenization result" and the example output you gave does in fact seem to be the output of the tokenizer -- not the output of the TokenFilters you mentioned -- since ShingleFilter would be producing more output tokens then you listed. All of which suggests that the discrepency you are seeing is in your tokenizer. Generally speaking: the best way to ensure folks on the mailing list can make sense of your situation and offer assistance is if you can provide reproducible snippets of code w/hardcoded input (ala unit tests) that demonstrates what you're seeing. : Our current code is based on Lucene7. : In some analyzer testcase, give a string "*Google's biologist’s*", the : tokenization result is, *["google", "biologist"]* : : But after I migrating the codebase to Lucene9, : the result becomes, *["googles", "**biologist’s**"]* : The analyzer uses the following three Lucene libraries: : : org.apache.lucene.analysis.core.FlattenGraphFilter; : : org.apache.lucene.analysis.shingle.ShingleFilter; : : org.apache.lucene.analysis.synonym.SynonymGraphFilter; -Hoss http://www.lucidworks.com/
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org