Re: Best practice - preparing search term for Lucene

Hrvoje Lončar Fri, 23 Sep 2022 02:29:59 -0700

Hi Stephane!

Actually, I have excactly that kind of conversion, but I didn't mention as
my mail was long enough whithout it :)
My main concern it should I let Lucene index original keywords or not.
Considering what you wrote, I guess your answer would be to store only
converted values without exotic characters.


Thanks a lot for your reply!

BR,
Hrvoje

On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat <passig...@hotmail.com>
wrote:

> Hello,
>
> The way I did it took me some time and I almost sure it's applicable to
> all languages.
>
> I normalized the words. Replacing letters or group of letters by another
> approaching one.
>
> In french e é è ê ai ei sound a bit the same, and for someone who write
> mistakes having to use the right letters is very frustrating. So I
> transformed all of them into e...
>
> Hope it helps
>
> Télécharger BlueMail pour Android<https://bluemail.me>
> Le 22 sept. 2022, à 16:37, "Hrvoje Lončar" <horv...@gmail.com<mailto:
> horv...@gmail.com>> a écrit:
>
> Hi!
>
> I'm using Hibernate Search / Lucene to index my entities in Spring Boot
> aplication.
>
> One thing I'm not sure is how to handle Croatian specific letters.
> Croatian language has few additional letters "*č* *Č* *ć* *Ć* *đ* *Đ* *š*
> *Š* *ž* *Ž*".
> Letters "*đ* *Đ*" are commonly replaced with "*dj* *DJ*" when no Croatian
> letters available.
>
> In my custom Hibernate bridge there is a step that replaces all Croatian
> characters with appropriate ASCII replacements which means "*č*" becomes "
> *c*", "*š*" becomes "*s*" and so on.
> Later, when user enters search text, the same process is done to match
> values from index.
> There is one more good thing about it - some older users that used
> computers in early ages when no Croatian letters were available - those
> users type words without Croatian letters, automatically replacing "*č*"
> with
> "*c*" and that fits my logic to get good search results.
>
> For example, the title of my entity is: "*juha s češnjakom u đumbirom*".
> My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
> ".
> Then user enters "*juha s češnjakom*".
> Before issuing a search, the same conversion is made to users' query and
> text sent to Lucene is "*juha cesnjakom*".
> This is the way how I implemented it and it's working fine.
>
> The other way would be to index original text and then find words with
> Croatian characters, convert them to ASCII and add to original.
> The title "*juha s češnjakom i đumbirom*" would become "*juha češnjakom
> đumbirom cesnjakom dumbirom*".
> In that case there is no need to convert users' search terms because
> both "*juha
> s češnjakom*" and "*juha s cesnjakom*" would return the same result.
>
> My question is:
> Is there any reason to switch to this alternative logic and have original
> keywords indexed in parallel with those converted to ASCII?
>
> Thanks!
>
> BR,
> Hrvoje
>


-- 
*{{ **Horvoje.net <https://horvoje.net/> ~~ **VegCook.net
<https://vegcook.net/>* *~~* *TheVegCat.com
<https://thevegcat.com:9999/> ~~ **Cuspajz.com <https://cuspajz.com/>
~~ VintageZagreb.net <https://vintagezagreb.net/> ~~ **Sterilizacija.org
<https://sterilizacija.org/> **~~* *SmijSe.com <https://smijse.com/>
~~ **HTMLutil.net
<https://htmlutil.net/> ~~ HTTPinfo.net <https://httpinfo.net/> }}*

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best practice - preparing search term for Lucene

Reply via email to