Re: Best practice - preparing search term for Lucene

Michael Sokolov Fri, 23 Sep 2022 09:02:02 -0700

I think it depends how precise you want to make the search. If you
want to enable diacritic-sensitive search in order to avoid confusions
when users actually are able to enter the diacritics, you can index
both ways (ascii-folded and not folded) and not normalize the query
terms. Or you can just fold everything and not worry about it. In
French I know there are confusable words like "cote" which has at
least a few different meanings depending on the accents. Not sure how
it is in Croatian.


On Fri, Sep 23, 2022 at 5:30 AM Hrvoje Lončar <horv...@gmail.com> wrote:
>
> Hi Stephane!
>
> Actually, I have excactly that kind of conversion, but I didn't mention as my 
> mail was long enough whithout it :)
> My main concern it should I let Lucene index original keywords or not.
> Considering what you wrote, I guess your answer would be to store only 
> converted values without exotic characters.
>
> Thanks a lot for your reply!
>
> BR,
> Hrvoje
>
> On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat <passig...@hotmail.com> 
> wrote:
>>
>> Hello,
>>
>> The way I did it took me some time and I almost sure it's applicable to all 
>> languages.
>>
>> I normalized the words. Replacing letters or group of letters by another 
>> approaching one.
>>
>> In french e é è ê ai ei sound a bit the same, and for someone who write 
>> mistakes having to use the right letters is very frustrating. So I 
>> transformed all of them into e...
>>
>> Hope it helps
>>
>> Télécharger BlueMail pour Android<https://bluemail.me>
>> Le 22 sept. 2022, à 16:37, "Hrvoje Lončar" 
>> <horv...@gmail.com<mailto:horv...@gmail.com>> a écrit:
>>
>> Hi!
>>
>> I'm using Hibernate Search / Lucene to index my entities in Spring Boot
>> aplication.
>>
>> One thing I'm not sure is how to handle Croatian specific letters.
>> Croatian language has few additional letters "*č* *Č* *ć* *Ć* *đ* *Đ* *š*
>> *Š* *ž* *Ž*".
>> Letters "*đ* *Đ*" are commonly replaced with "*dj* *DJ*" when no Croatian
>> letters available.
>>
>> In my custom Hibernate bridge there is a step that replaces all Croatian
>> characters with appropriate ASCII replacements which means "*č*" becomes "
>> *c*", "*š*" becomes "*s*" and so on.
>> Later, when user enters search text, the same process is done to match
>> values from index.
>> There is one more good thing about it - some older users that used
>> computers in early ages when no Croatian letters were available - those
>> users type words without Croatian letters, automatically replacing "*č*" with
>> "*c*" and that fits my logic to get good search results.
>>
>> For example, the title of my entity is: "*juha s češnjakom u đumbirom*".
>> My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
>> ".
>> Then user enters "*juha s češnjakom*".
>> Before issuing a search, the same conversion is made to users' query and
>> text sent to Lucene is "*juha cesnjakom*".
>> This is the way how I implemented it and it's working fine.
>>
>> The other way would be to index original text and then find words with
>> Croatian characters, convert them to ASCII and add to original.
>> The title "*juha s češnjakom i đumbirom*" would become "*juha češnjakom
>> đumbirom cesnjakom dumbirom*".
>> In that case there is no need to convert users' search terms because
>> both "*juha
>> s češnjakom*" and "*juha s cesnjakom*" would return the same result.
>>
>> My question is:
>> Is there any reason to switch to this alternative logic and have original
>> keywords indexed in parallel with those converted to ASCII?
>>
>> Thanks!
>>
>> BR,
>> Hrvoje
>
>
>
> --
> {{ Horvoje.net ~~ VegCook.net ~~ TheVegCat.com ~~ Cuspajz.com ~~ 
> VintageZagreb.net ~~ Sterilizacija.org ~~ SmijSe.com ~~ HTMLutil.net ~~ 
> HTTPinfo.net }}
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best practice - preparing search term for Lucene

Reply via email to