Re: [ol-tech] Accents in author names

information Sun, 03 Jun 2012 22:42:35 -0700

On May 30, 2012, at 3:36 AM, Ben Companjen wrote:

> On 29 May 2012 20:05, Lee Passey <[email protected]> wrote:
>> On 5/29/2012 8:18 AM, Jonathan Rochkind wrote:
>>> On 5/23/2012 10:54 PM, Tom Morris wrote:
>>>> I'd argue that there are (at least) two problems: 1) that byte
>>>> codes/codepoints aren't being normalized effectively and 2) rendering
>>>> systems are rendering improperly.
>>>> 
>>>> In my opinion, any reasonable rendering system should render the byte
>>>> codes for base character plus combining accent *exactly* the same as
>>>> the precombined form.
>>> 
>>> Okay, so which is more feasible to fix, OL's normalization, or every
>>> system out there that OL data might end up in, some of which don't
>>> render unicode properly in some (non-)normalized forms?
>>> 
>>> You can blame the world if you want, but you probably aren't going to be
>>> able to fix it.
>> 
>> Yes, but Mr. Morris points out two problems: 1. Open"Library" has
>> screwed up; 2. Browsers have screwed up.
>> 
>> It's true that we can do nothing about 2. Can't we at least do something
>> about 1?
> 
> This is almost a bug report: 
> https://support.mozilla.org/en-US/questions/928308
> Perhaps we can give the issue more importance if people who have the
> same issue click "I have this problem, too!"?
> 
> I agree that OL could use some extra normalization. Isn't there a bot
> for it yet? I assume a bot could look for character sequences
> involving separate accents and replace the sequence by the normalized
> character. It won't be able to guess what the strange whitespaces
> should have been, unfortunately.


I'm new to this discussion , but I would say "discount" what all other software 
 (software out in the wild for rendering text)does and stick to recognized 
methods for dealing with diacritical marks (and lets start by calling them what 
they are)
if necessary provide Two API access points so that people using the system can 
choose. (dumps should use 1)
1. Unicode
2. Combined character+ diacritical
But store internally in OL with 1.

As regards bots, to some extent the OL bots are in part responsible for  data 
corruption  already.  I spent a good few hours cleaning up "nigga" & "fuck you" 
sections out of books, that were traceable back to bots that had picked up 
'corrupted text' on the internet then merged into OL.
The sad thing was that this had then been propagated back down to all other 
book indexes using the OL data.
Perhaps we can add a "staging" flag to books that the bot thinks are 'wrong' 
then have  a web front end where a REGISTERED user can make the relevant merges.

HC



_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] Accents in author names

Reply via email to