Re: Best practice - preparing search term for Lucene

Stephane Passignat Fri, 23 Sep 2022 10:25:53 -0700

Hi

I would don't store the original value. That's "just" an index. But store the 
value of your db identifiers, because I think you'll want it at some point. (I 
made the same kind of feature on top of datanucleus)


I use to have tech id in my db. Even more since I started to use jdo jpa some 
20 years ago.

With Lucerne I would also suggest to store a pretty view on entities. This 
allows to have the ready to display info without querying the db.
As you won't be able to index a full big database, think about the restart if 
the indexer. Having numeric Id and last update field helped me.

Had you thought about numbers?



Télécharger BlueMail pour Android<https://bluemail.me>
Le 23 sept. 2022, à 09:30, "Hrvoje Lončar" 
<horv...@gmail.com<mailto:horv...@gmail.com>> a écrit:
Hi Stephane!

Actually, I have excactly that kind of conversion, but I didn't mention as my 
mail was long enough whithout it :)
My main concern it should I let Lucene index original keywords or not.
Considering what you wrote, I guess your answer would be to store only 
converted values without exotic characters.

Thanks a lot for your reply!

BR,
Hrvoje

On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat < 
passig...@hotmail.com<mailto:passig...@hotmail.com>> wrote:
Hello,

The way I did it took me some time and I almost sure it's applicable to all 
languages.

I normalized the words. Replacing letters or group of letters by another 
approaching one.

In french e é è ê ai ei sound a bit the same, and for someone who write 
mistakes having to use the right letters is very frustrating. So I transformed 
all of them into e...

Hope it helps

Télécharger BlueMail pour Android< https://bluemail.me>
Le 22 sept. 2022, à 16:37, "Hrvoje Lončar" < 
horv...@gmail.com<mailto:horv...@gmail.com><mailto: 
horv...@gmail.com<mailto:horv...@gmail.com>>> a écrit:

Hi!

I'm using Hibernate Search / Lucene to index my entities in Spring Boot
aplication.

One thing I'm not sure is how to handle Croatian specific letters.
Croatian language has few additional letters "*č* *Č* *ć* *Ć* *đ* *Đ* *š*
*Š* *ž* *Ž*".
Letters "*đ* *Đ*" are commonly replaced with "*dj* *DJ*" when no Croatian
letters available.

In my custom Hibernate bridge there is a step that replaces all Croatian
characters with appropriate ASCII replacements which means "*č*" becomes "
*c*", "*š*" becomes "*s*" and so on.
Later, when user enters search text, the same process is done to match
values from index.
There is one more good thing about it - some older users that used
computers in early ages when no Croatian letters were available - those
users type words without Croatian letters, automatically replacing "*č*" with
"*c*" and that fits my logic to get good search results.

For example, the title of my entity is: "*juha s češnjakom u đumbirom*".
My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
".
Then user enters "*juha s češnjakom*".
Before issuing a search, the same conversion is made to users' query and
text sent to Lucene is "*juha cesnjakom*".
This is the way how I implemented it and it's working fine.

The other way would be to index original text and then find words with
Croatian characters, convert them to ASCII and add to original.
The title "*juha s češnjakom i đumbirom*" would become "*juha češnjakom
đumbirom cesnjakom dumbirom*".
In that case there is no need to convert users' search terms because
both "*juha
s češnjakom*" and "*juha s cesnjakom*" would return the same result.

My question is:
Is there any reason to switch to this alternative logic and have original
keywords indexed in parallel with those converted to ASCII?

Thanks!

BR,
Hrvoje


--
{{  Horvoje.net<https://horvoje.net/> ~~  VegCook.net<https://vegcook.net/>   
~~   TheVegCat.com<https://thevegcat.com:9999/> ~~  
Cuspajz.com<https://cuspajz.com/> ~~ 
VintageZagreb.net<https://vintagezagreb.net/> ~~  
Sterilizacija.org<https://sterilizacija.org/>  ~~   
SmijSe.com<https://smijse.com/> ~~  HTMLutil.net<https://htmlutil.net/> ~~ 
HTTPinfo.net<https://httpinfo.net/> }}

Re: Best practice - preparing search term for Lucene

Reply via email to