Thanks Mikhail! I have tried all other tokenizers from Lucene4.4. In case of WhitespaceTokwnizer, it loses romanizing of special chars like - etc
On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev <m...@apache.org> wrote: > Hello, > Check the whitespace tokenizer. > > On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar <amiteshk...@gmail.com> > wrote: > > > Hi, > > > > I am facing a requirement change to get % sign retained in searches. e.g. > > > > Sample search docs: > > 1. Number of boys 50 > > 2. My score was 50% > > 3. 40-50% for pass score > > > > Search query: 50% > > Expected results: Doc-2, Doc-3 i.e. > > My score was > > 1. 50% > > 2. 40-50% for pass score > > > > Actual result: All 3 documents (because tokenizer strips off the % both > > during indexing as well as searching and hence matches all docs with 50 > in > > it. > > > > On the implementation front, I am using a set of filters like > > lowerCaseFilter, EnglishPossessiveFilter etc in addition to base > tokenizer > > StandardTokenizer. > > > > Per my analysis suggests, StandardTokenizer strips off the % I am > facing a > > requirement change to get % sign retained in searches. e.g > > > > Sample search docs: > > 1. Number of boys 50 > > 2. My score was 50% > > 3. 40-50% for pass score > > > > Search query: 50% > > Expected results: Doc-2, Doc-3 i.e. > > My score was 50% > > 40-50% for pass score > > > > Actual result: All 4 documents > > > > On the implementation front, I am using a set of filters like > > lowerCaseFilter, EnglishPossessiveFilter etc in addition to base > tokenizer > > StandardTokenizer. > > > > Per my analysis, StandardTOkenizer strips off the % sign and hence the > > behavior.Has someone faced similar requirement? Any help/guidance is > highly > > appreciated. > > > > > -- > Sincerely yours > Mikhail Khludnev >