Re: [Koha-devel] Problematic Zebra Charmaps Equivalences

Fridolin SOMERS Tue, 24 Nov 2015 00:08:13 -0800

Hie,

Thanks a lot for this complex study.


sort-string-utf.chr exists indeed, in etc/zebradb/lang_defs/xx.

It depends on the language chosen at install because it containsprefixes escaped for sort, for example :

  map (^The\s)    @

The CHR is indeed not perfect.
We use ICU, for a French catalog it seems OK.

A good point is that you can really customize tokenization via configfiles xxxx-icu.xml


Regards,

Le 18/11/2015 01:25, David Cook a écrit :

Hi all:

Yet another Zebra email from this guy.

I don’t know how many of you are using CHR vs ICU, but CHR is the default for
installs, so I’m guessing that it’s quite a few.

Well, there are some issues with how we use the equivalent directive. Hopefully
the UTF-8 won’t be stripped out of this message, although I’m guessing it might…

Here’s all instances of the directive in word-phrase-utf.chr:

# Characters to be considered equivalent for sorting purposes

equivalent aáàãåâăąȧǎȁȃ

equivalent ӕä(ae)

equivalent ā(aa)

equivalent iíìîịĩĭįǐȉȋ

equivalent ï(ie)

equivalent ī(ii)

equivalent uúùûũŭųűǔȕȗ

equivalent ü(ue)

equivalent ū(uu)

equivalent eéèêẽĕęėěȅȇ

equivalent ëē(ee)

equivalent oóòõôŏǫȯőǒȍȏ

equivalent Œœöø(oe)

equivalent ō(oo)

Firstly, that comment is wrong. “equivalent” isn’t just for sorting purposes.
It’s for searching purposes. Indexdata have confirmed that the documentation is
wrong about the sorting thing.

So “ie” and ï (if you can’t see this character, it’s the UTF-8 representation of
&iuml;) are equivalent. That means searches for “siemon” will get results for
“siemon” and “sïmon”.

Now, there is also a “map” directive:

map ï i

This means that “sïmon” is the same as “simon”. Now, “map” affects both
indexing and searching. If you have “sïmon” in a record, you can see that it is
actually stored as “simon” in Zebra, if you do a search for it and use “format
xml” and “elements zebra::index”.

So your search for “siemon” will really get results for “siemon” and “simon”.

This really isn’t ideal. However, you can see why you’d want equivalences. In
Scandinavian languages, I think “å” and “aa” are roughly equivalent. They’re
spelled differently but they’re the same sound. So if you search for “Gaard”,
you might want hits for “Gård” as well.

But you might not want “career” to be equivalent to “carer” as they’re two different
words. Or “choose” to be equivalent to “chose”, “sloop” - "slop”, “reef” -
"ref”, etc.

Unfortunately, I don’t really know what the solution is. For one client, I’ve
disabled the equivalent directive where it creates an equivalence between any
two letter combination with a one letter combination, as they only have records
in English, and it’ll just cause them headaches.

I can see this being useful for multilingual records… although I think many
people with multilingual records use ICU. I don’t know ICU well enough to know
how it manages characters that English speakers would think of as accents or
ligatures. I know you can provide your own normalization with ICU, but I think
it does a fair amount on its own as well…

I think some of the difficulties are mentioned here:
http://userguide.icu-project.org/collation/icu-string-search-service. It also
mentions the Danish å/aa example. I don’t know how ICU would know how to handle
particular languages… that webpage seems to indicate you can provide a locale
to deal with it.

Of course, that doesn’t necessarily solve things. If you have multilingual
records with multilingual users, how do you choose your rules? Sure, you might
be able to specify a locale at search time (note you can’t do this with Zebra),
but what rules did you specify at index time?

As anyone who has watched this video
(https://www.youtube.com/watch?v=0j74jcxSunY) would know, internationalis(z)ing
code has many challenges…

Anyway, the reason for this email is mostly just to make you all aware of this
issue, and how “equivalent” and “map” work in the Charmap files when using CHR
indexing.

Oh, also, if you look at “default.idx”, you’ll see that “sort s” references
“charmap sort-string-utf.chr”, but I don’t think sort-string-utf.chr actually
exists anywhere…

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St, Ultimo, NSW 2007

_______________________________________________
Koha-devel mailing list
[email protected]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


--
Fridolin SOMERS
Biblibre - Pôles support et système
[email protected]
_______________________________________________
Koha-devel mailing list
[email protected]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

Re: [Koha-devel] Problematic Zebra Charmaps Equivalences

Reply via email to