On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
I agree with the assertion that people SHOULD know how unicode
works if they want to work with it, but the way our docs are
now is off-putting enough that most probably won't learn
anything. If they know, they know; if they don't, the wall of
jargon is intimidating and hard to grasp (more examples up
front of more things that you'd actually use std.uni for).
Even though I'm decently familiar with Unicode, I was having
trouble following all that (e.g. Isn't "noe\u0308l" a grapheme
cluster according to std.uni?). On the flip side, std.utf has
a serious dearth of examples and the relationship between the
two isn't clear.
I thought it was nice that std.uni had a proper terminology
section, complete with links to Unicode documents to kick-start
beginners to Unicode. It mentions its relationship with std.utf
right at the top.
Maybe the first paragraph is just too thin, and it's hard to see
the big picture. Maybe it should include a small leading
paragraph detailing the three levels of Unicode granularity that
D/Phobos chooses; arrays of code units -> ranges of code points
-> std.uni for graphemes and algorithms.
Yes, please. While operations on single codepoints and
characters seem pretty robust (i.e. you can do lots of things
with and to them), it feels like it just falls apart when you
try to work with strings. It honestly surprised me how many
things in std.uni don't seem to work on ranges.
-Wyatt
Most string code is Unicode-correct as long as it works on code
points and all inputs are of the same normalization format;
explicit grapheme-awareness is rarely a necessity. By that I mean
the most common string operations, such as searching, getting a
substring etc. will work without any special grapheme decoding
(beyond normalization).
The hiccups appear when code points are shuffled around, or the
order is changed. Apart from these rare string manipulation
cases, grapheme awareness is necessary for rendering code.