On 19 May 2012 03:49, Anand Chitipothu <[email protected]> wrote:
>
[snip]
>> Although both are valid, I think not normalizing is an invitation for
>> confusion.  If it's different from the source, perhaps the import bot
>> was normalizing at one point, but was using Normalization Form D
>> (NFD).  I think Normalization Form C (NFC) is more natural for most
>> people (and processing systems) and have recommended Freebase adopt
>> it.  I'd recommend OpenLibrary do the same.
>> http://unicode.org/reports/tr15/#Norm_Forms
>>
[snip]
>>
>> MARC8 using combining diacritics,
>> http://www.loc.gov/marc/specifications/speccharmarc8.html#combine
>> so it's not too surprising that a direct translation would would yield
>> the same in Unicode, but I'd suggest that it's better to combine them
>> into their NFC form.
>
> OL converts data to NFC normalized form in many places.

I just realised that via the author search or normal search box, you
can search for strings with wildcards. An author search for *́*
(asterisk <acute> asterisk) yields 202,887 hits. So there are quite a
lot of times normalisation did not happen.

The "good" thing is: most, if not all, characters can be replaced by
their NFC counterparts, because the NFD characters can be located.
The bad thing is, well more a glitch, if I'm correct: one has to
scrape author IDs from these search pages, because there is no
wildcard search in the API. I noticed AMillarBot was
replacing/correcting missing Umlauts, so perhaps some of the code is
already there.

Ben
>
> Anand
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to