Hi Katrin, 

I actually did create a transliterate rule which was able to convert the input 
from "Sinti-Swing" to "SintiSwing Sinti Swing" which created "sintiswing", 
"sinti", and "swing" as tokens. However, I think the ICU chain file gets used 
for both indexing and searching, so while it would work for indexing, it 
wouldn't work for searching, as you'd get hits for irrelevant results. 

The only other thing I can think of is perhaps modifying 
biblio-zebra-indexdefs.xsl to send the input data twice. Once with the hyphen 
and once without the hyphen. That would be hugely laborious though I think. (I 
just noticed we have a chopPunctuation template in biblio-zebra-indexdefs.xsl 
which I don't think actually gets used.)

I did notice something interesting today when I was looking at ICU: 
http://userguide.icu-project.org/boundaryanalysis. Observe the following:

Line break:
|Parlez-|vous |français ?|

Word break:
|Parlez|-|vous| |français| |?|

At the moment, we use line break in ICU. I suppose there isn't a huge 
difference between the two. But I thought it was interesting. I hadn't really 
thought about it before. 

It feels like there should be a way of having "Mont-Royal" be indexed as 
"MontRoyal" as well as "Mont" and "Royal". Currently, they're indexed as "Mont" 
and "Royal", retrieving relevant "Mont-Royal" only records would require using 
an exact match phrase search to require the proximity of "Mont Royal" in order 
to get a hit. It would be nice for "Mont-Royal" to retrieve "MontRoyal" while 
"Royal" would retrieve indexed "Royal" records. 

I wonder what Lucene-based search engines like Solr and Elasticsearch do 
there...

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Direct: 02 8005 0595

-----Original Message-----

Date: Tue, 24 Sep 2019 21:55:07 +0200
From: Katrin Fischer <[email protected]>
To: [email protected]
Subject: Re: [Koha] How to make the Koha/Zebra search ignore hyphens?
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8; format=flowed

Hi Michael,

we looked into this ages ago and it didn't seem possible to achieve both
- treating hyphen (-) as a space and not a space at the same time. Maybe
we missed something - If there is a solution, I'd be interested in a
how-to! :)

Katrin


Attachment: signature.asc
Description: PGP signature

_______________________________________________
Koha mailing list  http://koha-community.org
[email protected]
https://lists.katipo.co.nz/mailman/listinfo/koha

Reply via email to