Hi, Some 8 months ago I submitted an issue [1] that included some standard library Python code to normalise unicode strings to NFC. This weekend I revisisted the issue and tried to get a really simple test running on an Open Library dump file.
Now, I'm/VacuumBot is far from ready to go and before I go on and eventually change perhaps half a million records, I'd like advice (or help) :) There have been multiple discussions on this topic over the years - the last one ended with Tom Morris writing "The problem is well characterized after years of discussion and the fix is simple. It just needs implementing." [2] I'm not so sure the fix is as simple as running unicodedata.normalize() over all records... Or is it? First of all: has anyone been doing anything similar? Then, the unicodedata library works with version 5.2 of the Unicode database. Does anyone know of problems that that may bring along (outdated normalisations maybe)? I don't think it will solve all combinations, like the ones in a related issue [3] and certainly not the badly imported ones. Wait a minute - was the 'simple fix' reimporting the badly imported records? Ben [1] https://github.com/internetarchive/openlibrary/issues/149 [2] http://www.mail-archive.com/[email protected]/msg00677.html [3] https://github.com/internetarchive/openlibrary/issues/150 _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
