While looking through the old bugs at Launchpad, I came across
https://bugs.launchpad.net/openlibrary/+bug/598204 ("Normalize
unicode") which is exactly the 'issue' we were discussing here.

And there is a really simple solution, provided by Edward Betts, well
Python's standard library actually, in the same bug report:

from unicodedata import normalize

def norm(s):
    return normalize('NFC', s)

I just created an issue on GitHub, as the LP bugs are not tracked anymore:
https://github.com/internetarchive/openlibrary/issues/149

It doesn't ask to reimport records to recreate missing letters yet,
that can perhaps be made another issue.

Ben

On 4 June 2012 01:11, Ben Companjen <[email protected]> wrote:
> On 19 May 2012 03:49, Anand Chitipothu <[email protected]> wrote:
>>
> [snip]
>>> Although both are valid, I think not normalizing is an invitation for
>>> confusion.  If it's different from the source, perhaps the import bot
>>> was normalizing at one point, but was using Normalization Form D
>>> (NFD).  I think Normalization Form C (NFC) is more natural for most
>>> people (and processing systems) and have recommended Freebase adopt
>>> it.  I'd recommend OpenLibrary do the same.
>>> http://unicode.org/reports/tr15/#Norm_Forms
>>>
> [snip]
>>>
>>> MARC8 using combining diacritics,
>>> http://www.loc.gov/marc/specifications/speccharmarc8.html#combine
>>> so it's not too surprising that a direct translation would would yield
>>> the same in Unicode, but I'd suggest that it's better to combine them
>>> into their NFC form.
>>
>> OL converts data to NFC normalized form in many places.
>
> I just realised that via the author search or normal search box, you
> can search for strings with wildcards. An author search for *́*
> (asterisk <acute> asterisk) yields 202,887 hits. So there are quite a
> lot of times normalisation did not happen.
>
> The "good" thing is: most, if not all, characters can be replaced by
> their NFC counterparts, because the NFD characters can be located.
> The bad thing is, well more a glitch, if I'm correct: one has to
> scrape author IDs from these search pages, because there is no
> wildcard search in the API. I noticed AMillarBot was
> replacing/correcting missing Umlauts, so perhaps some of the code is
> already there.
>
> Ben
>>
>> Anand
>> _______________________________________________
>> Ol-tech mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>> To unsubscribe from this mailing list, send email to 
>> [email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to