Hi Stephane! Actually, I have excactly that kind of conversion, but I didn't mention as my mail was long enough whithout it :) My main concern it should I let Lucene index original keywords or not. Considering what you wrote, I guess your answer would be to store only converted values without exotic characters.
Thanks a lot for your reply! BR, Hrvoje On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat <passig...@hotmail.com> wrote: > Hello, > > The way I did it took me some time and I almost sure it's applicable to > all languages. > > I normalized the words. Replacing letters or group of letters by another > approaching one. > > In french e é è ê ai ei sound a bit the same, and for someone who write > mistakes having to use the right letters is very frustrating. So I > transformed all of them into e... > > Hope it helps > > Télécharger BlueMail pour Android<https://bluemail.me> > Le 22 sept. 2022, à 16:37, "Hrvoje Lončar" <horv...@gmail.com<mailto: > horv...@gmail.com>> a écrit: > > Hi! > > I'm using Hibernate Search / Lucene to index my entities in Spring Boot > aplication. > > One thing I'm not sure is how to handle Croatian specific letters. > Croatian language has few additional letters "*č* *Č* *ć* *Ć* *đ* *Đ* *š* > *Š* *ž* *Ž*". > Letters "*đ* *Đ*" are commonly replaced with "*dj* *DJ*" when no Croatian > letters available. > > In my custom Hibernate bridge there is a step that replaces all Croatian > characters with appropriate ASCII replacements which means "*č*" becomes " > *c*", "*š*" becomes "*s*" and so on. > Later, when user enters search text, the same process is done to match > values from index. > There is one more good thing about it - some older users that used > computers in early ages when no Croatian letters were available - those > users type words without Croatian letters, automatically replacing "*č*" > with > "*c*" and that fits my logic to get good search results. > > For example, the title of my entity is: "*juha s češnjakom u đumbirom*". > My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom* > ". > Then user enters "*juha s češnjakom*". > Before issuing a search, the same conversion is made to users' query and > text sent to Lucene is "*juha cesnjakom*". > This is the way how I implemented it and it's working fine. > > The other way would be to index original text and then find words with > Croatian characters, convert them to ASCII and add to original. > The title "*juha s češnjakom i đumbirom*" would become "*juha češnjakom > đumbirom cesnjakom dumbirom*". > In that case there is no need to convert users' search terms because > both "*juha > s češnjakom*" and "*juha s cesnjakom*" would return the same result. > > My question is: > Is there any reason to switch to this alternative logic and have original > keywords indexed in parallel with those converted to ASCII? > > Thanks! > > BR, > Hrvoje > -- *{{ **Horvoje.net <https://horvoje.net/> ~~ **VegCook.net <https://vegcook.net/>* *~~* *TheVegCat.com <https://thevegcat.com:9999/> ~~ **Cuspajz.com <https://cuspajz.com/> ~~ VintageZagreb.net <https://vintagezagreb.net/> ~~ **Sterilizacija.org <https://sterilizacija.org/> **~~* *SmijSe.com <https://smijse.com/> ~~ **HTMLutil.net <https://htmlutil.net/> ~~ HTTPinfo.net <https://httpinfo.net/> }}*
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org