Thanks Chris and Patrick's help.

Chris is right. I don't have much knowledge in the analyzers, so I've
missed many details.
Following Chris's advice, I've digged deeper in the the related code about
tokenizers, and fixed the problem.

Thanks for help.

On Fri, Apr 28, 2023 at 4:44 PM Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> You provided a list of TokenFilters that you use in your Analyzer,
> but you didn't mention anything about what Tokenizer you are using.
>
> You also mentioned seeing a difference in the "tokenization result" and
> the example output you gave does in fact seem to be the output of the
> tokenizer -- not the output of the TokenFilters you mentioned -- since
> ShingleFilter would be producing more output tokens then you listed.
>
> All of which suggests that the discrepency you are seeing is in your
> tokenizer.
>
> Generally speaking: the best way to ensure folks on the mailing list can
> make sense of your situation and offer assistance is if you can provide
> reproducible snippets of code w/hardcoded input (ala unit tests) that
> demonstrates what you're seeing.
>
> : Our current code is based on Lucene7.
> : In some analyzer testcase, give a string "*Google's biologist’s*", the
> : tokenization result is, *["google", "biologist"]*
> :
> : But after I migrating the codebase to Lucene9,
> : the result becomes, *["googles", "**biologist’s**"]*
>
>
> : The analyzer uses the following three Lucene libraries:
> :
> : org.apache.lucene.analysis.core.FlattenGraphFilter;
> :
> : org.apache.lucene.analysis.shingle.ShingleFilter;
> :
> : org.apache.lucene.analysis.synonym.SynonymGraphFilter;
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to