On 1/16/11 3:20 PM, Michel Fortin wrote:
On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
<[email protected]> said:
On 1/15/11 10:45 PM, Michel Fortin wrote:
No doubt it's easier to implement it that way. The problem is that in
most cases it won't be used. How many people really know what is a
grapheme?
How many people really should care?
I think the only people who should *not* care are those who have
validated that the input does not contain any combining code point. If
you know the input *can't* contain combining code points, then it's safe
to ignore them.
I agree. Now let me ask again: how many people really should care?
If we don't make correct Unicode handling the default, someday someone
is going to ask a developer to fix a problem where his system doesn't
handle some text correctly. Later that day, he'll come to the
realization that almost none of his D code and none of the D libraries
he use handle unicode correctly, and he'll say: can't fix this. His peer
working on a similar Objective-C program will have a good laugh.
Sure, correct Unicode handling is slower and more complicated to
implement, but at least you know you'll get the right results.
I love the increased precision, but again I'm not sure how many people
ever manipulate text with combining characters. Meanwhile they'll
complain that D is slower than other languages.
Of those, how many will forget to use byGrapheme at one time
or another? And so in most programs string manipulation will misbehave
in the presence of combining characters or unnormalized strings.
But most strings don't contain combining characters or unnormalized
strings.
I think we should expect combining marks to be used more and more as our
OS text system and fonts start supporting them better. Them being rare
might be true today, but what do you know about tomorrow?
I don't think languages will acquire more diacritics soon. I do hope, of
course, that D applications gain more usage in the Arabic, Hebrew etc.
world.
A few years ago, many Unicode symbols didn't even show up correctly on
Windows. Today, we have Unicode domain names and people start putting
funny symbols in them (for instance: <http://◉.ws>). I haven't seen it
yet, but we'll surely see combining characters in domain names soon
enough (if only as a way to make fun of programs that can't handle
Unicode correctly). Well, let me be the first to make fun of such
programs: <http://☺̭̏.michelf.com/>.
Would you bet the language on that?
Also, not all combining characters are marks meant to be used by some
foreign languages. Some are used for mathematics for instance. Or you
could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay
indicating some kind of prohibition.
If you want to help D programmers write correct code when it comes to
Unicode manipulation, you need to help them iterate on real characters
(graphemes), and you need the algorithms to apply to real characters
(graphemes), not the approximation of a Unicode character that is a code
point.
I don't think the situation is as clean cut, as grave, and as urgent
as you say.
I agree it's probably not as clean cut as I say (I'm trying to keep
complicated things simple here), but it's something important to decide
early because the cost of changing it increase as more code is written.
Agreed.
Quoting the first part of the same post (out of order):
Disagreement as that might be, a simple fact that needs to be taken
into account is that as of right now all of Phobos uses UTF arrays for
string representation and dchar as element type.
Besides, for one I do dispute the idea that a grapheme element is
better than a dchar element for iterating over a string. The grapheme
has the attractiveness of being theoretically clean but at the same
time is woefully inefficient and helps languages that few D users need
to work with. At least that's my perception, and we need some serious
numbers instead of convincing rhetoric to make a big decision.
You'll no doubt get more performance from a grapheme-aware specialized
algorithm working directly on code points than by iterating on graphemes
returned as string slices. But both will give *correct* results.
Implementing a specialized algorithm of this kind becomes an
optimization, and it's likely you'll want an optimized version for most
string algorithms.
I'd like to have some numbers too about performance, but I have none at
this time.
I spent a fair amount of time comparing ASCII vs. Unicode code speed.
The fact of the matter is that the overhead is measurable and often
high. Also it occurs at a very core level. For starters, the grapheme
itself is larger and has one extra indirection. I am confident the
marginal overhead for graphemes would be considerable.
It's all a matter of picking one's trade-offs. Clearly ASCII is out as
no serious amount of non-English text can be trafficked without
diacritics. So switching to UTF makes a lot of sense, and that's what
D did.
When I introduced std.range and std.algorithm, they'd handle char[]
and wchar[] no differently than any other array. A lot of algorithms
simply did the wrong thing by default, so I attempted to fix that
situation by defining byDchar(). So instead of passing some string str
to an algorithm, one would pass byDchar(str).
A couple of weeks went by in testing that state of affairs, and before
late I figured that I need to insert byDchar() virtually _everywhere_.
There were a couple of algorithms (e.g. Boyer-Moore) that happened to
work with arrays for subtle reasons (needless to say, they won't work
with graphemes at all). But by and large the situation was that the
simple and intuitive code was wrong and that the correct code
necessitated inserting byDchar().
So my next decision, which understandably some of the people who
didn't go through the experiment may find unintuitive, was to make
byDchar() the default. This cleaned up a lot of crap in std itself and
saved a lot of crap in the yet-unwritten client code.
But were your algorithms *correct* in the first place? I'd argue that by
making byDchar the default you've not saved yourself from any crap
because dchar isn't the right layer of abstraction.
It was correct for all but a couple languages. Again: most of today's
languages don't ever need combining characters.
I think it's reasonable to understand why I'm happy with the current
state of affairs. It is better than anything we've had before and
better than everything else I've tried.
It is indeed easy to understand why you're happy with the current state
of affairs: you never had to deal with multi-code-point character and
can't imagine yourself having to deal with them on a semi-frequent
basis.
Do you, and can you?
Other people won't be so happy with this state of affairs, but
they'll probably notice only after most of their code has been written
unaware of the problem.
They can't be unaware and write said code.
Now, thanks to the effort people have spent in this group (thank
you!), I have an understanding of the grapheme issue. I guarantee that
grapheme-level iteration will have a high cost incurred to it:
efficiency and changes in std. The languages that need composing
characters for producing meaningful text are few and far between, so
it makes sense to confine support for them to libraries that are not
the default, unless we find ways to not disrupt everyone else.
We all are more aware of the problem now, that's a good thing. :-)
All I wish is it's not blown out of proportion. It fares rather low on
my list of library issues that D has right now.
Andrei