[ol-tech] Unicode normalisation (bug revisited)

Ben Companjen Mon, 25 Feb 2013 15:51:34 -0800

Hi,

Some 8 months ago I submitted an issue [1] that included some standard
library Python code to normalise unicode strings to NFC.
This weekend I revisisted the issue and tried to get a really simple
test running on an Open Library dump file.


Now, I'm/VacuumBot is far from ready to go and before I go on and
eventually change perhaps half a million records, I'd like advice (or
help) :)
There have been multiple discussions on this topic over the years -
the last one ended with Tom Morris writing "The problem is well
characterized after years of discussion and the fix is simple. It just
needs implementing." [2]
I'm not so sure the fix is as simple as running
unicodedata.normalize() over all records... Or is it?

First of all: has anyone been doing anything similar?
Then, the unicodedata library works with version 5.2 of the Unicode
database. Does anyone know of problems that that may bring along
(outdated normalisations maybe)?
I don't think it will solve all combinations, like the ones in a
related issue [3] and certainly not the badly imported ones. Wait a
minute - was the 'simple fix' reimporting the badly imported records?

Ben

[1] https://github.com/internetarchive/openlibrary/issues/149
[2] http://www.mail-archive.com/[email protected]/msg00677.html
[3] https://github.com/internetarchive/openlibrary/issues/150
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

[ol-tech] Unicode normalisation (bug revisited)

Reply via email to