I think it depends how precise you want to make the search. If you want to enable diacritic-sensitive search in order to avoid confusions when users actually are able to enter the diacritics, you can index both ways (ascii-folded and not folded) and not normalize the query terms. Or you can just fold everything and not worry about it. In French I know there are confusable words like "cote" which has at least a few different meanings depending on the accents. Not sure how it is in Croatian.
On Fri, Sep 23, 2022 at 5:30 AM Hrvoje Lončar <horv...@gmail.com> wrote: > > Hi Stephane! > > Actually, I have excactly that kind of conversion, but I didn't mention as my > mail was long enough whithout it :) > My main concern it should I let Lucene index original keywords or not. > Considering what you wrote, I guess your answer would be to store only > converted values without exotic characters. > > Thanks a lot for your reply! > > BR, > Hrvoje > > On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat <passig...@hotmail.com> > wrote: >> >> Hello, >> >> The way I did it took me some time and I almost sure it's applicable to all >> languages. >> >> I normalized the words. Replacing letters or group of letters by another >> approaching one. >> >> In french e é è ê ai ei sound a bit the same, and for someone who write >> mistakes having to use the right letters is very frustrating. So I >> transformed all of them into e... >> >> Hope it helps >> >> Télécharger BlueMail pour Android<https://bluemail.me> >> Le 22 sept. 2022, à 16:37, "Hrvoje Lončar" >> <horv...@gmail.com<mailto:horv...@gmail.com>> a écrit: >> >> Hi! >> >> I'm using Hibernate Search / Lucene to index my entities in Spring Boot >> aplication. >> >> One thing I'm not sure is how to handle Croatian specific letters. >> Croatian language has few additional letters "*č* *Č* *ć* *Ć* *đ* *Đ* *š* >> *Š* *ž* *Ž*". >> Letters "*đ* *Đ*" are commonly replaced with "*dj* *DJ*" when no Croatian >> letters available. >> >> In my custom Hibernate bridge there is a step that replaces all Croatian >> characters with appropriate ASCII replacements which means "*č*" becomes " >> *c*", "*š*" becomes "*s*" and so on. >> Later, when user enters search text, the same process is done to match >> values from index. >> There is one more good thing about it - some older users that used >> computers in early ages when no Croatian letters were available - those >> users type words without Croatian letters, automatically replacing "*č*" with >> "*c*" and that fits my logic to get good search results. >> >> For example, the title of my entity is: "*juha s češnjakom u đumbirom*". >> My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom* >> ". >> Then user enters "*juha s češnjakom*". >> Before issuing a search, the same conversion is made to users' query and >> text sent to Lucene is "*juha cesnjakom*". >> This is the way how I implemented it and it's working fine. >> >> The other way would be to index original text and then find words with >> Croatian characters, convert them to ASCII and add to original. >> The title "*juha s češnjakom i đumbirom*" would become "*juha češnjakom >> đumbirom cesnjakom dumbirom*". >> In that case there is no need to convert users' search terms because >> both "*juha >> s češnjakom*" and "*juha s cesnjakom*" would return the same result. >> >> My question is: >> Is there any reason to switch to this alternative logic and have original >> keywords indexed in parallel with those converted to ASCII? >> >> Thanks! >> >> BR, >> Hrvoje > > > > -- > {{ Horvoje.net ~~ VegCook.net ~~ TheVegCat.com ~~ Cuspajz.com ~~ > VintageZagreb.net ~~ Sterilizacija.org ~~ SmijSe.com ~~ HTMLutil.net ~~ > HTTPinfo.net }} > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org