On Sat, December 11, 2010 5:24 pm, Jeulin-L Michael wrote:

[snip]
> I am now wondering how are you guy managing unicode characters from the json
> file ?
>
> For instance unicode characters in "Kha\u0304lid Muh\u0323ammad
> \u02bbAli\u0304 al-H\u0323a\u0304jj" doesn't make sens at all.

While JSON is technically UTF-8 enabled, the OL developers have chosen to
encode Unicode characters using the "\u" escape sequence, which is also
allowed in JSON. Thus, "\u" followed by a four character hexadecimal number
represents a single Unicode character, at the specified code point. Thus, the
acute 'e' that Mr. Millar was complaining about just a few minutes ago should
be encoded as "\u0233".

In your case the encoding can be a little confusing, because OL has used the
"Combining diacritical marks" set (range 300-36f) [1]. These Unicode
"characters" are designed not to be used as standalone characters, but rather
as a means of modifying the /preceding/ character. "\u0304" is meaningless on
its own, but "a\u0304" means "the character 'a' combined with a macron over
it." Every possible Latin-based European language character can be represented
both as ASCII with a combining diacritical mark and as a "precomposed"
character. Because of the existence of combining diacritical marks, it is
important to perform Unicode normalization [2] before comparing Unicode
strings.

This is, I believe, evidence of the continuing tension between "things as they
are" and "things as they appear to be" which plagues our attempts to digitize
texts. I can't comment on whether or not combining diacritical marks are the
best way to do latin transliterations of Arabic names (personally, I would
have used "\u0101" instead of "a\u0304") but at least now you know what OL has
done, and can adjust for it if you wish.

[1] http://www.unicode.org/charts/PDF/U0300.pdf.
[2] http://en.wikipedia.org/wiki/Unicode_normalization

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to