On 18-May-2012, at 10:54 PM, Tom Morris wrote: > On Fri, May 18, 2012 at 6:51 AM, Ben Companjen <[email protected]> wrote: > >> I have noticed a couple of times that accents in names seem to be >> disconnected from the letters. It may depend on the font and / or >> rendering whether you see it, but when I look at >> <http://openlibrary.org/authors/OL5264776A/Barcynska_He%CC%81le%CC%80ne_Countess.>, >> the accents seem to float over the letters, a little to the right. (I >> see the escaped URI does take the letter and the accent apart...) >> >> Compare: Hélène (copied from OL) and Hélène (typed myself) >> >> Are these imported from a 'bad' source? This example was imported from >> Talis, but the specific record >> <http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664> >> shows the correct symbols. Or does/did ImportBot create these separate >> accents? > > Both those strings render the same for me, so the rendering issue > sounds like a problem with whatever rendering system you use. > > Having said that, there are multiple ways to create accented > characters in Unicode. There are single code points which have the > base letter and accent pre-composed and there are separate accent code > points that can be combined with the base letter from a different code > point to create the character. > > Although both are valid, I think not normalizing is an invitation for > confusion. If it's different from the source, perhaps the import bot > was normalizing at one point, but was using Normalization Form D > (NFD). I think Normalization Form C (NFC) is more natural for most > people (and processing systems) and have recommended Freebase adopt > it. I'd recommend OpenLibrary do the same. > http://unicode.org/reports/tr15/#Norm_Forms > > Actually, looking at the raw record more closely, > http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664 > it's character encoding is MARC8 (space character in position 9 of leader) > http://www.loc.gov/marc/bibliographic/bdleader.html > so something is converting it to Unicode for the web rendering and > it's obviously doing it differently than the importer did. > > MARC8 using combining diacritics, > http://www.loc.gov/marc/specifications/speccharmarc8.html#combine > so it's not too surprising that a direct translation would would yield > the same in Unicode, but I'd suggest that it's better to combine them > into their NFC form.
OL converts data to NFC normalized form in many places. Anand _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
