On Fri, May 18, 2012 at 9:49 PM, Anand Chitipothu <[email protected]> wrote: > On 18-May-2012, at 10:54 PM, Tom Morris wrote: > > On Fri, May 18, 2012 at 6:51 AM, Ben Companjen <[email protected]> > > wrote: > > > >> I have noticed a couple of times that accents in names seem to be > >> disconnected from the letters. It may depend on the font and / or > >> rendering whether you see it, but when I look at > >> > >> <http://openlibrary.org/authors/OL5264776A/Barcynska_He%CC%81le%CC%80ne_Countess.>, > >> the accents seem to float over the letters, a little to the right. (I > >> see the escaped URI does take the letter and the accent apart...) > >> > >> Compare: Hélène (copied from OL) and Hélène (typed myself) > >> > >> Are these imported from a 'bad' source? This example was imported from > >> Talis, but the specific record > >> > >> <http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664> > >> shows the correct symbols. Or does/did ImportBot create these separate > >> accents? > > > > Both those strings render the same for me, so the rendering issue > > sounds like a problem with whatever rendering system you use. > > > > Having said that, there are multiple ways to create accented > > characters in Unicode. There are single code points which have the > > base letter and accent pre-composed and there are separate accent code > > points that can be combined with the base letter from a different code > > point to create the character. > > > > Although both are valid, I think not normalizing is an invitation for > > confusion. If it's different from the source, perhaps the import bot > > was normalizing at one point, but was using Normalization Form D > > (NFD). I think Normalization Form C (NFC) is more natural for most > > people (and processing systems) and have recommended Freebase adopt > > it. I'd recommend OpenLibrary do the same. > > http://unicode.org/reports/tr15/#Norm_Forms > > > > Actually, looking at the raw record more closely, > > > > http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664 > > it's character encoding is MARC8 (space character in position 9 of > > leader) > > http://www.loc.gov/marc/bibliographic/bdleader.html > > so something is converting it to Unicode for the web rendering and > > it's obviously doing it differently than the importer did. > > > > MARC8 using combining diacritics, > > http://www.loc.gov/marc/specifications/speccharmarc8.html#combine > > so it's not too surprising that a direct translation would would yield > > the same in Unicode, but I'd suggest that it's better to combine them > > into their NFC form. > > OL converts data to NFC normalized form in many places.
I guess the real question is does OL convert to NFC on *all* import & input operations? That is what is needed, in my opinion. If it does now but didn't previously, we probably need to spawn a cleanup task to fixup any bad names from earlier imports. The alternate names for the entry that Ben highlighted show issues beyond just NFC vs NFD: Barcynska, He le ne Countess. Barcynska, Hel̇e`ne Countess. Barcynska, Hélene Countess. The first form has converted the combining accents to spaces. The second form has converted the combining accents into a) a combining dot above U+307 and b) a spacing (not combining) grave accent U+0060. These problems may have been introduced upstream and, if so, won't be fixable because there's been too much information lost, but if they were introduced on import to OpenLibrary, they could be fixed by re-examining/converting the source record. Is there any way of finding the source for the alternate names (or what other author records Ben merged) to figure out where the problem was introduced? Tom Tom _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
