Re: [ol-tech] Unicode normalisation (bug revisited)

Karen Coyle Mon, 25 Feb 2013 17:03:39 -0800

Ben, it looks to me like these links point to different problems.

1) The Russian ligature problem:[1] I've seen this before and it may 
have to do with how the original data in the MARC records was coded. In 
the MARC character set, where the diacritic precedes the character, it 
was apparently not clear how to get the "double ligature" coded 
correctly and so there were many errors in inputting these characters. 
I'm not sure this will be fixed with a normalization to NFC. Have you 
tested this case?


2) SOLR creating word breaks at (certain?) diacritics: [2] this seems to 
be a SOLR indexing problem, and I would hope that there is a way to tell 
SOLR how to treat specific diacritics.

3) SOLR not allowing retrieval on accented and un-accented versions of 
the same word:[3] Again, I would hope that this is a SOLR setting of 
some sort. I don't know that it relates to NFC normalization -- were you 
able to determine that?

4) Not knowing much about how the Unicode normalization works, a) can it 
be done on existing data and b) does this mean that all of those records 
will be to be re-indexed (or is this just a display issue?).

kc


[1] https://github.com/internetarchive/openlibrary/issues/150
[2] https://bugs.launchpad.net/openlibrary/+bug/389217
[3] http://www.mail-archive.com/[email protected]/msg00343.html


On 2/25/13 3:51 PM, Ben Companjen wrote:
> Hi,
>
> Some 8 months ago I submitted an issue [1] that included some standard
> library Python code to normalise unicode strings to NFC.
> This weekend I revisisted the issue and tried to get a really simple
> test running on an Open Library dump file.
>
> Now, I'm/VacuumBot is far from ready to go and before I go on and
> eventually change perhaps half a million records, I'd like advice (or
> help) :)
> There have been multiple discussions on this topic over the years -
> the last one ended with Tom Morris writing "The problem is well
> characterized after years of discussion and the fix is simple. It just
> needs implementing." [2]
> I'm not so sure the fix is as simple as running
> unicodedata.normalize() over all records... Or is it?
>
> First of all: has anyone been doing anything similar?
> Then, the unicodedata library works with version 5.2 of the Unicode
> database. Does anyone know of problems that that may bring along
> (outdated normalisations maybe)?
> I don't think it will solve all combinations, like the ones in a
> related issue [3] and certainly not the badly imported ones. Wait a
> minute - was the 'simple fix' reimporting the badly imported records?
>
> Ben
>
> [1] https://github.com/internetarchive/openlibrary/issues/149
> [2] http://www.mail-archive.com/[email protected]/msg00677.html
> [3] https://github.com/internetarchive/openlibrary/issues/150
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
>

-- 
Karen Coyle
[email protected] http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] Unicode normalisation (bug revisited)

Reply via email to