Thank you! I will give it a try and share my findings with you all Regards Amitesh
On Thu, Sep 21, 2023 at 08:18 Uwe Schindler <u...@thetaphi.de> wrote: > The problem with WhitespaceTokenizer is that is splits only on > whitespace. If you have text like "This is, was some test." then you get > tokens like "is," and "test." including the punctuations. > > This is the reason why StandardTokenizer is normally used for human > readable text. WhitespaceTokenizer is normally only used for special > stuff like token lists (like tags) or uinque identifiers,... > > As quick workaround while still keeping the %, you can add a CharFilter > like MappingCharFilter before the Tokenizer that replaces the "%" char > by something else which is not stripped off. As this is done for both > indexing and searching this does not hurt you. How about a "percent > emoji"? :-) > > Another common "workaround" is also shown in some Solr default > configurations typically used for product search: Those use > WhitespaceTokenizer, followed by WordDelimiterFilter. WDF is then able > to remove accents and handle stuff like product numbers correctly. There > you can possibly make sure thet "%" survives. > > Uwe > > Am 20.09.2023 um 22:42 schrieb Amitesh Kumar: > > Thanks Mikhail! > > > > I have tried all other tokenizers from Lucene4.4. In case of > > WhitespaceTokwnizer, it loses romanizing of special chars like - etc > > > > > > On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev <m...@apache.org> wrote: > > > >> Hello, > >> Check the whitespace tokenizer. > >> > >> On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar <amiteshk...@gmail.com> > >> wrote: > >> > >>> Hi, > >>> > >>> I am facing a requirement change to get % sign retained in searches. > e.g. > >>> > >>> Sample search docs: > >>> 1. Number of boys 50 > >>> 2. My score was 50% > >>> 3. 40-50% for pass score > >>> > >>> Search query: 50% > >>> Expected results: Doc-2, Doc-3 i.e. > >>> My score was > >>> 1. 50% > >>> 2. 40-50% for pass score > >>> > >>> Actual result: All 3 documents (because tokenizer strips off the % both > >>> during indexing as well as searching and hence matches all docs with 50 > >> in > >>> it. > >>> > >>> On the implementation front, I am using a set of filters like > >>> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base > >> tokenizer > >>> StandardTokenizer. > >>> > >>> Per my analysis suggests, StandardTokenizer strips off the % I am > >> facing a > >>> requirement change to get % sign retained in searches. e.g > >>> > >>> Sample search docs: > >>> 1. Number of boys 50 > >>> 2. My score was 50% > >>> 3. 40-50% for pass score > >>> > >>> Search query: 50% > >>> Expected results: Doc-2, Doc-3 i.e. > >>> My score was 50% > >>> 40-50% for pass score > >>> > >>> Actual result: All 4 documents > >>> > >>> On the implementation front, I am using a set of filters like > >>> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base > >> tokenizer > >>> StandardTokenizer. > >>> > >>> Per my analysis, StandardTOkenizer strips off the % sign and hence the > >>> behavior.Has someone faced si > <https://www.google.com/maps/search/behavior.Has+someone+faced+si?entry=gmail&source=g>milar > requirement? Any help/guidance is > >> highly > >>> appreciated. > >>> > >> > >> -- > >> Sincerely yours > >> Mikhail Khludnev > >> > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >