Re: Unicode handling comparison

Charles Hixson Wed, 27 Nov 2013 11:37:12 -0800

On 11/27/2013 08:53 AM, Jakob Ovrum wrote:

On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
I agree with the assertion that people SHOULD know how unicode worksif they want to work with it, but the way our docs are now isoff-putting enough that most probably won't learn anything. If theyknow, they know; if they don't, the wall of jargon is intimidatingand hard to grasp (more examples up front of more things that you'dactually use std.uni for). Even though I'm decently familiar withUnicode, I was having trouble following all that (e.g. Isn't"noe\u0308l" a grapheme cluster according to std.uni?). On the flipside, std.utf has a serious dearth of examples and the relationshipbetween the two isn't clear.
I thought it was nice that std.uni had a proper terminology section,complete with links to Unicode documents to kick-start beginners toUnicode. It mentions its relationship with std.utf right at the top.
Maybe the first paragraph is just too thin, and it's hard to see thebig picture. Maybe it should include a small leading paragraphdetailing the three levels of Unicode granularity that D/Phoboschooses; arrays of code units -> ranges of code points -> std.uni forgraphemes and algorithms.
Yes, please. While operations on single codepoints and charactersseem pretty robust (i.e. you can do lots of things with and to them),it feels like it just falls apart when you try to work with strings.It honestly surprised me how many things in std.uni don't seem towork on ranges.
-Wyatt
Most string code is Unicode-correct as long as it works on code pointsand all inputs are of the same normalization format; explicitgrapheme-awareness is rarely a necessity. By that I mean the mostcommon string operations, such as searching, getting a substring etc.will work without any special grapheme decoding (beyond normalization).
The hiccups appear when code points are shuffled around, or the orderis changed. Apart from these rare string manipulation cases, graphemeawareness is necessary for rendering code.

I would put things a bit more emphatically. The codepoint is analogousto assembler, where the character is analogous to a high level language(and the binary representation is analogous to a binaryrepresentation). The desire is to make the characters easy to use in away that is cheap to do. To me this means that the highlevel language(i.e., D) should make it easy to deal with characters, possible to dealwith codepoints, and you can deal with binary representations if youreally want to. (Also note the isomorphism between assembler code andbinary is matched by an isomorphism between codepoints and binaryrepresentation.) To do this cheaply, D needs to know what kind ofnormalization each string is in. This is likely to cost one byte perstring, unless there's some slack in the current representation.

But is this worth while? This is the direction that things willeventually go, but that doesn't really mean that we need to push them inthat direction today. But if D had a default normalization thatoccurred during i/o operations, to cost of the normalization wouldprobably be lost during the impedance matching between RAM and storage.(Again, however, any default requires the ability to be overridden.)


Also, of course, none of this will be of any significance to ASCII.

--
Charles Hixson

Re: Unicode handling comparison

Reply via email to