Yep, I'd bet money it's a unicode normalization form issue.

Browsers are known to sometimes have trouble displaying unicode chars in 
certain forms. I forget which is which -- but you almost certainly want 
to normalize incoming unicode using one of the standard unicode 
algorithms (as for instance provided by the ICU library).

This is going to effect not only display, but also any kind of search 
you provide -- you don't want a false negative because the query form 
didn't match the stored form -- you've got to normalize on indexing, and 
normalize queries.

You may need to provide a different normalization form for display and 
for indexing.

You've definitely got to understand and deal with normalization when 
dealing with unicode, there's no way around it.  This unicode standard 
doc is a bit technical, but pretty good: http://unicode.org/reports/tr15/

On 5/18/2012 5:17 PM, Ben Companjen wrote:
> Hi Tom,
>
> On 18 May 2012 19:24, Tom Morris<[email protected]>  wrote:
>> On Fri, May 18, 2012 at 6:51 AM, Ben Companjen<[email protected]>  
>> wrote:
>>
>>> I have noticed a couple of times that accents in names seem to be
>>> disconnected from the letters. It may depend on the font and / or
>>> rendering whether you see it, but when I look at
>>> <http://openlibrary.org/authors/OL5264776A/Barcynska_He%CC%81le%CC%80ne_Countess.>,
>>> the accents seem to float over the letters, a little to the right. (I
>>> see the escaped URI does take the letter and the accent apart...)
>>>
>>> Compare: Hélène (copied from OL) and Hélène (typed myself)
>>>
>>> Are these imported from a 'bad' source? This example was imported from
>>> Talis, but the specific record
>>> <http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664>
>>> shows the correct symbols. Or does/did ImportBot create these separate
>>> accents?
>>
>> Both those strings render the same for me, so the rendering issue
>> sounds like a problem with whatever rendering system you use.
>
> That's interesting. I was looking at the strings in Firefox 12 on
> Windows 7. Whereas these show minor differences in distance to the
> letter and shape, some other times the accents are almost completely
> over the character to the right... Attached is how I see this example
> on the Open Library website and in the source.
> I may file a report with Firefox on this, if that is the source of the
> problems. Notepad++ and LibreOffice render the combined characters
> correctly.
>
>>
>> Having said that, there are multiple ways to create accented
>> characters in Unicode.  There are single code points which have the
>> base letter and accent pre-composed and there are separate accent code
>> points that can be combined with the base letter from a different code
>> point to create the character.
>>
>> Although both are valid, I think not normalizing is an invitation for
>> confusion.  If it's different from the source, perhaps the import bot
>> was normalizing at one point, but was using Normalization Form D
>> (NFD).  I think Normalization Form C (NFC) is more natural for most
>> people (and processing systems) and have recommended Freebase adopt
>> it.  I'd recommend OpenLibrary do the same.
>> http://unicode.org/reports/tr15/#Norm_Forms
>>
>> Actually, looking at the raw record more closely,
>> http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:1446177421:664
>> it's character encoding is MARC8 (space character in position 9 of leader)
>> http://www.loc.gov/marc/bibliographic/bdleader.html
>> so something is converting it to Unicode for the web rendering and
>> it's obviously doing it differently than the importer did.
>>
>> MARC8 using combining diacritics,
>> http://www.loc.gov/marc/specifications/speccharmarc8.html#combine
>> so it's not too surprising that a direct translation would would yield
>> the same in Unicode, but I'd suggest that it's better to combine them
>> into their NFC form.
>
> I hope the staff agree :)
>
> Thanks for clearing that up.
>
> Ben
>>
>> Tom
>> _______________________________________________
>> Ol-tech mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>> To unsubscribe from this mailing list, send email to 
>> [email protected]
>>
>>
>> _______________________________________________
>> Ol-tech mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>> To unsubscribe from this mailing list, send email to 
>> [email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to