Re: Help to find the RC of incompatible analyers

Chris Hostetter Fri, 28 Apr 2023 16:44:19 -0700

You provided a list of TokenFilters that you use in your Analyzer, 
but you didn't mention anything about what Tokenizer you are using.


You also mentioned seeing a difference in the "tokenization result" and 
the example output you gave does in fact seem to be the output of the 
tokenizer -- not the output of the TokenFilters you mentioned -- since 
ShingleFilter would be producing more output tokens then you listed.

All of which suggests that the discrepency you are seeing is in your 
tokenizer.

Generally speaking: the best way to ensure folks on the mailing list can 
make sense of your situation and offer assistance is if you can provide 
reproducible snippets of code w/hardcoded input (ala unit tests) that 
demonstrates what you're seeing.

: Our current code is based on Lucene7.
: In some analyzer testcase, give a string "*Google's biologist’s*", the
: tokenization result is, *["google", "biologist"]*
: 
: But after I migrating the codebase to Lucene9,
: the result becomes, *["googles", "**biologist’s**"]*


: The analyzer uses the following three Lucene libraries:
: 
: org.apache.lucene.analysis.core.FlattenGraphFilter;
: 
: org.apache.lucene.analysis.shingle.ShingleFilter;
: 
: org.apache.lucene.analysis.synonym.SynonymGraphFilter;


-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Help to find the RC of incompatible analyers

Reply via email to