Re: How to retain % sign next to number during tokenization

Amitesh Kumar Wed, 20 Sep 2023 13:42:44 -0700

Thanks Mikhail!

I have tried all other tokenizers from Lucene4.4. In case of
WhitespaceTokwnizer, it loses romanizing of special chars like - etc



On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev <m...@apache.org> wrote:

> Hello,
> Check the whitespace tokenizer.
>
> On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar <amiteshk...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I am facing a requirement change to get % sign retained in searches. e.g.
> >
> > Sample search docs:
> > 1. Number of boys 50
> > 2. My score was 50%
> > 3. 40-50% for pass score
> >
> > Search query: 50%
> > Expected results: Doc-2, Doc-3 i.e.
> > My score was
> > 1. 50%
> > 2. 40-50% for pass score
> >
> > Actual result: All 3 documents (because tokenizer strips off the % both
> > during indexing as well as searching and hence matches all docs with 50
> in
> > it.
> >
> > On the implementation front, I am using a set of filters like
> > lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> tokenizer
> > StandardTokenizer.
> >
> > Per my analysis suggests, StandardTokenizer strips off the %  I am
> facing a
> > requirement change to get % sign retained in searches. e.g
> >
> > Sample search docs:
> > 1. Number of boys 50
> > 2. My score was 50%
> > 3. 40-50% for pass score
> >
> > Search query: 50%
> > Expected results: Doc-2, Doc-3 i.e.
> > My score was 50%
> > 40-50% for pass score
> >
> > Actual result: All 4 documents
> >
> > On the implementation front, I am using a set of filters like
> > lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> tokenizer
> > StandardTokenizer.
> >
> > Per my analysis, StandardTOkenizer strips off the %  sign and hence the
> > behavior.Has someone faced similar requirement? Any help/guidance is
> highly
> > appreciated.
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: How to retain % sign next to number during tokenization

Reply via email to