Re: [ol-tech] Accents in author names

Anand Chitipothu Fri, 18 May 2012 18:49:43 -0700

On 18-May-2012, at 10:54 PM, Tom Morris wrote:

> On Fri, May 18, 2012 at 6:51 AM, Ben Companjen <[email protected]> wrote:
> 
>> I have noticed a couple of times that accents in names seem to be
>> disconnected from the letters. It may depend on the font and / or
>> rendering whether you see it, but when I look at
>> <http://openlibrary.org/authors/OL5264776A/Barcynska_He%CC%81le%CC%80ne_Countess.>,
>> the accents seem to float over the letters, a little to the right. (I
>> see the escaped URI does take the letter and the accent apart...)
>> 
>> Compare: Hélène (copied from OL) and Hélène (typed myself)
>> 
>> Are these imported from a 'bad' source? This example was imported from
>> Talis, but the specific record
>> <http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664>
>> shows the correct symbols. Or does/did ImportBot create these separate
>> accents?
> 
> Both those strings render the same for me, so the rendering issue
> sounds like a problem with whatever rendering system you use.
> 
> Having said that, there are multiple ways to create accented
> characters in Unicode.  There are single code points which have the
> base letter and accent pre-composed and there are separate accent code
> points that can be combined with the base letter from a different code
> point to create the character.
> 
> Although both are valid, I think not normalizing is an invitation for
> confusion.  If it's different from the source, perhaps the import bot
> was normalizing at one point, but was using Normalization Form D
> (NFD).  I think Normalization Form C (NFC) is more natural for most
> people (and processing systems) and have recommended Freebase adopt
> it.  I'd recommend OpenLibrary do the same.
> http://unicode.org/reports/tr15/#Norm_Forms
> 
> Actually, looking at the raw record more closely,
> http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664
> it's character encoding is MARC8 (space character in position 9 of leader)
> http://www.loc.gov/marc/bibliographic/bdleader.html
> so something is converting it to Unicode for the web rendering and
> it's obviously doing it differently than the importer did.
> 
> MARC8 using combining diacritics,
> http://www.loc.gov/marc/specifications/speccharmarc8.html#combine
> so it's not too surprising that a direct translation would would yield
> the same in Unicode, but I'd suggest that it's better to combine them
> into their NFC form.


OL converts data to NFC normalized form in many places. 

Anand
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] Accents in author names

Reply via email to