Re: [Koha] How to make the Koha/Zebra search ignore hyphens?

Michael Kuhn Wed, 25 Sep 2019 11:59:59 -0700

Hi David

> I'm glad that I got you a bit further on your journey. It's a shame
> about having to use the CHR indexing. You can find more information

> here athttps://software.indexdata.com/zebra/doc/character-map-files.html.

>
> After reading through that, I'm thinking perhaps that CHR indexing
> can't help you.


Thanks for your assessment!

> You could ask Indexdata for more information, but I'm guessing it
> can't be done with CHR. It should be doable with ICU though.

So I tried to change the Koha-Standard CHR to ICU according tohttps://wiki.koha-community.org/wiki/ICU_chains_configuration, justusing the original configuration of "words-icu.xml" and"phrases-icu.xml", then restarting Zebra and reindexing. But getting avery unexpected result: Now a catalog search


* for "Sintiswing" shows 1 hit

* for "Sinti-Swing" shows 4'222 hits, the hyphen seems to be ignoredcompletely and everything is found that contains either "Sinti" OR"Swing" or both

* for "Sinti Swing" shows 18 hits, the hyphen is used as a breakingcharacter, so any record containing "Sinti-Swing" or "Sinti" AND "Swing"

 is found, but not "Sintiswing"

In short: The Koha standard configuration of ICU ("words-icu.xml" and"phrases-icu.xml") seems defective to me. The results are much worsethan what CHR gives. And of course the desired result isn't there yetanyway.

Do you maybe have a hint where to find some documentation about how tochange the behaviour of ICU indexing in the desired way?


Best wishes: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E [email protected] · W www.adminkuhn.ch





Am 25.09.19 um 08:34 schrieb [email protected]:

Hi Michael,

I'm glad that I got you a bit further on your journey. It's a shame about 
having to use the CHR indexing. You can find more information here at 
https://software.indexdata.com/zebra/doc/character-map-files.html.

After reading through that, I'm thinking perhaps that CHR indexing can't help 
you.

You could ask Indexdata for more information, but I'm guessing it can't be done 
with CHR. It should be doable with ICU though.

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Direct: 02 8005 0595

-----Original Message-----
From: Michael Kuhn <[email protected]>
Sent: Wednesday, 25 September 2019 4:47 AM
To: [email protected]; [email protected]
Subject: Re: [Koha] How to make the Koha/Zebra search ignore hyphens?

Hi David

Many thanks for your reply and the hints!

After a standard installation of Koha 18.11 the CHR indexing is used, thus the 
configuration is done in file "word-phrase-utf.chr".

A catalog search
* for "Sintiswing" shows 1 hit
* for "Sinti-Swing" shows 18 hits, the hyphen is used as a breaking character, so any record containing 
"Sinti-Swing" or "Sinti" and "Swing"
is found, but not "Sintiswing"

I changed the following line, omitting the hyphen (between comma and dot):

space
{\001-\040}!"#$%&'\()*+,./:;<=>?@\[\\]^_`\{|}~’{\x88-\x89}{\x98-\x9C}¡¿«»

After a Zebra reindexing a catalog search
* for "Sintiswing" shows 1 hit
* for "Sinti-Swing" now shows only 8 hits, the hyphen is no more used as a breaking character, so any record 
containing "Sinti Swing" or "Sinti-Swing" is found, but not "Sintiswing"

I also tried to add "map (-) @" but this leads to the original results.

In short: My change of configuration didn't lead to the desired result... If searching for 
"Sintiswing" also "Sinti-Swing" should be found, and vice versa. This is not 
the case.

Since I couldn't find any documentation about CHR indexing - does anyone know 
where to find out more about the CHR way of indexing?

Best wishes: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin 
Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 
· E [email protected] · W www.adminkuhn.ch



Am 19.09.19 um 03:29 schrieb [email protected]:

Hi Michael,

That's really interesting. I assume that you're using ICU indexing?

You could update "phrases-icu.xml" and "words-icu.xml" to strip out hyphens. 
You would need to re-index all your records afterwards though.

I haven't actually tested that particular change, but just taking a little look with both ICU and CHR and it 
looks like hyphens are used to tokenize. Currently, when you search "Tee-Ei", you're actually 
searching for "Tee" and "Ei".

If you're using ICU, you could add a transform rule before the tokenize rule to remove the hyphen. 
This would prevent it from tokenizing and then "Tee-Ei" and "Teeei" should 
retrieve the same records.

Beware also that this is a universal change. You might want to check to see if 
there are hyphens that shouldn't be removed. If so, you may need to make a more 
complex rule to try to just capture the desired cases.

If you're using CHR, you can take a look at word-phrase-utf.chr and remove - from the 
"Breaking characters" section. You may or may not also need to map it. I'm less 
familiar with CHR indexing.

Anyway, I hope that helps.

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Direct: 02 8005 0595

-----Original Message-----

Date: Wed, 18 Sep 2019 22:46:15 +0200
From:   
To: "Koha : access" <[email protected]>
Subject: [Koha] How to make the Koha/Zebra search ignore hyphens?
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8; format=flowed

Hi

We have found that, at least in German, there are words or combinations
of words that can be written in different ways, and both are correct and
are meaning the same, e. g.

* Ultraschallmessgerät = Ultraschall-Messgerät
* Sintiswing = Sinti-Swing
* Teeei = Tee-Ei
* Haftpflichtversicherungsgesellschaft =
Haftpflicht-Versicherungsgesellschaft

This is a general concept in German, so it makes no sense to add a "used
for/see from:" in the authority data. Anyway, such words can exist
everywhere in the bibliographic record, not only in fields linked to
authority fields.

Now the question: is there a way how to teach Koha (or Zebra) to look
for the second term also when the first term is searched, and vice
versa? Or shorter: Just to ignore the hyphens? Using the standard
configuration Koha will not find the second term if the first one is
searched, and vice bversa.

We would appreciate any hint or tip!

Best wishes: Michael



_______________________________________________
Koha mailing list  http://koha-community.org
[email protected]
https://lists.katipo.co.nz/mailman/listinfo/koha

Re: [Koha] How to make the Koha/Zebra search ignore hyphens?

Reply via email to