Hie,

Thanks a lot for this complex study.

sort-string-utf.chr exists indeed, in etc/zebradb/lang_defs/xx.
It depends on the language chosen at install because it contains prefixes escaped for sort, for example :
  map (^The\s)    @

The CHR is indeed not perfect.
We use ICU, for a French catalog it seems OK.
A good point is that you can really customize tokenization via config files xxxx-icu.xml

Regards,

Le 18/11/2015 01:25, David Cook a écrit :
Hi all:



Yet another Zebra email from this guy.



I don’t know how many of you are using CHR vs ICU, but CHR is the default for 
installs, so I’m guessing that it’s quite a few.



Well, there are some issues with how we use the equivalent directive. Hopefully 
the UTF-8 won’t be stripped out of this message, although I’m guessing it might…



Here’s all instances of the directive in word-phrase-utf.chr:



# Characters to be considered equivalent for sorting purposes

equivalent aáàãåâăąȧǎȁȃ

equivalent ӕä(ae)

equivalent ā(aa)

equivalent iíìîịĩĭįǐȉȋ

equivalent ï(ie)

equivalent ī(ii)

equivalent uúùûũŭųűǔȕȗ

equivalent ü(ue)

equivalent ū(uu)

equivalent eéèêẽĕęėěȅȇ

equivalent ëē(ee)

equivalent oóòõôŏǫȯőǒȍȏ

equivalent Œœöø(oe)

equivalent ō(oo)



Firstly, that comment is wrong. “equivalent” isn’t just for sorting purposes. 
It’s for searching purposes. Indexdata have confirmed that the documentation is 
wrong about the sorting thing.



So “ie” and ï (if you can’t see this character, it’s the UTF-8 representation of 
ï) are equivalent. That means searches for “siemon” will get results for 
“siemon” and “sïmon”.



Now, there is also a “map” directive:



map ï                                     i



This means that “sïmon” is the same as “simon”. Now, “map” affects both 
indexing and searching. If you have “sïmon” in a record, you can see that it is 
actually stored as “simon” in Zebra, if you do a search for it and use “format 
xml” and “elements zebra::index”.



So your search for “siemon” will really get results for “siemon” and “simon”.



This really isn’t ideal. However, you can see why you’d want equivalences. In 
Scandinavian languages, I think “å” and “aa” are roughly equivalent. They’re 
spelled differently but they’re the same sound. So if you search for “Gaard”, 
you might want hits for “Gård” as well.



But you might not want “career” to be equivalent to “carer” as they’re two different 
words. Or “choose” to be equivalent to “chose”, “sloop” - "slop”, “reef” - 
"ref”, etc.



--



Unfortunately, I don’t really know what the solution is. For one client, I’ve 
disabled the equivalent directive where it creates an equivalence between any 
two letter combination with a one letter combination, as they only have records 
in English, and it’ll just cause them headaches.



I can see this being useful for multilingual records… although I think many 
people with multilingual records use ICU. I don’t know ICU well enough to know 
how it manages characters that English speakers would think of as accents or 
ligatures. I know you can provide your own normalization with ICU, but I think 
it does a fair amount on its own as well…



I think some of the difficulties are mentioned here: 
http://userguide.icu-project.org/collation/icu-string-search-service. It also 
mentions the Danish å/aa example. I don’t know how ICU would know how to handle 
particular languages… that webpage seems to indicate you can provide a locale 
to deal with it.



Of course, that doesn’t necessarily solve things. If you have multilingual 
records with multilingual users, how do you choose your rules? Sure, you might 
be able to specify a locale at search time (note you can’t do this with Zebra), 
but what rules did you specify at index time?



As anyone who has watched this video 
(https://www.youtube.com/watch?v=0j74jcxSunY) would know, internationalis(z)ing 
code has many challenges…



--



Anyway, the reason for this email is mostly just to make you all aware of this 
issue, and how “equivalent” and “map” work in the Charmap files when using CHR 
indexing.



Oh, also, if you look at “default.idx”, you’ll see that “sort s” references 
“charmap sort-string-utf.chr”, but I don’t think sort-string-utf.chr actually 
exists anywhere…



David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St, Ultimo, NSW 2007






_______________________________________________
Koha-devel mailing list
[email protected]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


--
Fridolin SOMERS
Biblibre - Pôles support et système
[email protected]
_______________________________________________
Koha-devel mailing list
[email protected]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

Reply via email to