On Sat, December 11, 2010 5:24 pm, Jeulin-L Michael wrote: [snip] > I am now wondering how are you guy managing unicode characters from the json > file ? > > For instance unicode characters in "Kha\u0304lid Muh\u0323ammad > \u02bbAli\u0304 al-H\u0323a\u0304jj" doesn't make sens at all.
While JSON is technically UTF-8 enabled, the OL developers have chosen to encode Unicode characters using the "\u" escape sequence, which is also allowed in JSON. Thus, "\u" followed by a four character hexadecimal number represents a single Unicode character, at the specified code point. Thus, the acute 'e' that Mr. Millar was complaining about just a few minutes ago should be encoded as "\u0233". In your case the encoding can be a little confusing, because OL has used the "Combining diacritical marks" set (range 300-36f) [1]. These Unicode "characters" are designed not to be used as standalone characters, but rather as a means of modifying the /preceding/ character. "\u0304" is meaningless on its own, but "a\u0304" means "the character 'a' combined with a macron over it." Every possible Latin-based European language character can be represented both as ASCII with a combining diacritical mark and as a "precomposed" character. Because of the existence of combining diacritical marks, it is important to perform Unicode normalization [2] before comparing Unicode strings. This is, I believe, evidence of the continuing tension between "things as they are" and "things as they appear to be" which plagues our attempts to digitize texts. I can't comment on whether or not combining diacritical marks are the best way to do latin transliterations of Arabic names (personally, I would have used "\u0101" instead of "a\u0304") but at least now you know what OL has done, and can adjust for it if you wish. [1] http://www.unicode.org/charts/PDF/U0300.pdf. [2] http://en.wikipedia.org/wiki/Unicode_normalization _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
