On Fri, May 18, 2012 at 6:51 AM, Ben Companjen <[email protected]> wrote:

> I have noticed a couple of times that accents in names seem to be
> disconnected from the letters. It may depend on the font and / or
> rendering whether you see it, but when I look at
> <http://openlibrary.org/authors/OL5264776A/Barcynska_He%CC%81le%CC%80ne_Countess.>,
> the accents seem to float over the letters, a little to the right. (I
> see the escaped URI does take the letter and the accent apart...)
>
> Compare: Hélène (copied from OL) and Hélène (typed myself)
>
> Are these imported from a 'bad' source? This example was imported from
> Talis, but the specific record
> <http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664>
> shows the correct symbols. Or does/did ImportBot create these separate
> accents?

Both those strings render the same for me, so the rendering issue
sounds like a problem with whatever rendering system you use.

Having said that, there are multiple ways to create accented
characters in Unicode.  There are single code points which have the
base letter and accent pre-composed and there are separate accent code
points that can be combined with the base letter from a different code
point to create the character.

Although both are valid, I think not normalizing is an invitation for
confusion.  If it's different from the source, perhaps the import bot
was normalizing at one point, but was using Normalization Form D
(NFD).  I think Normalization Form C (NFC) is more natural for most
people (and processing systems) and have recommended Freebase adopt
it.  I'd recommend OpenLibrary do the same.
http://unicode.org/reports/tr15/#Norm_Forms

Actually, looking at the raw record more closely,
http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664
it's character encoding is MARC8 (space character in position 9 of leader)
http://www.loc.gov/marc/bibliographic/bdleader.html
so something is converting it to Unicode for the web rendering and
it's obviously doing it differently than the importer did.

MARC8 using combining diacritics,
http://www.loc.gov/marc/specifications/speccharmarc8.html#combine
so it's not too surprising that a direct translation would would yield
the same in Unicode, but I'd suggest that it's better to combine them
into their NFC form.

Tom
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to