On 19 May 2012 03:49, Anand Chitipothu <[email protected]> wrote: > [snip] >> Although both are valid, I think not normalizing is an invitation for >> confusion. If it's different from the source, perhaps the import bot >> was normalizing at one point, but was using Normalization Form D >> (NFD). I think Normalization Form C (NFC) is more natural for most >> people (and processing systems) and have recommended Freebase adopt >> it. I'd recommend OpenLibrary do the same. >> http://unicode.org/reports/tr15/#Norm_Forms >> [snip] >> >> MARC8 using combining diacritics, >> http://www.loc.gov/marc/specifications/speccharmarc8.html#combine >> so it's not too surprising that a direct translation would would yield >> the same in Unicode, but I'd suggest that it's better to combine them >> into their NFC form. > > OL converts data to NFC normalized form in many places.
I just realised that via the author search or normal search box, you can search for strings with wildcards. An author search for *́* (asterisk <acute> asterisk) yields 202,887 hits. So there are quite a lot of times normalisation did not happen. The "good" thing is: most, if not all, characters can be replaced by their NFC counterparts, because the NFD characters can be located. The bad thing is, well more a glitch, if I'm correct: one has to scrape author IDs from these search pages, because there is no wildcard search in the API. I noticed AMillarBot was replacing/correcting missing Umlauts, so perhaps some of the code is already there. Ben > > Anand > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
