On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <[email protected]> said:

On 1/15/11 10:45 PM, Michel Fortin wrote:
No doubt it's easier to implement it that way. The problem is that in
most cases it won't be used. How many people really know what is a
grapheme?

How many people really should care?

I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them.

If we don't make correct Unicode handling the default, someday someone is going to ask a developer to fix a problem where his system doesn't handle some text correctly. Later that day, he'll come to the realization that almost none of his D code and none of the D libraries he use handle unicode correctly, and he'll say: can't fix this. His peer working on a similar Objective-C program will have a good laugh.

Sure, correct Unicode handling is slower and more complicated to implement, but at least you know you'll get the right results.


Of those, how many will forget to use byGrapheme at one time
or another? And so in most programs string manipulation will misbehave
in the presence of combining characters or unnormalized strings.

But most strings don't contain combining characters or unnormalized strings.

I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?

A few years ago, many Unicode symbols didn't even show up correctly on Windows. Today, we have Unicode domain names and people start putting funny symbols in them (for instance: <http://◉.ws>). I haven't seen it yet, but we'll surely see combining characters in domain names soon enough (if only as a way to make fun of programs that can't handle Unicode correctly). Well, let me be the first to make fun of such programs: <http://☺̭̏.michelf.com/>.

Also, not all combining characters are marks meant to be used by some foreign languages. Some are used for mathematics for instance. Or you could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay indicating some kind of prohibition.


If you want to help D programmers write correct code when it comes to
Unicode manipulation, you need to help them iterate on real characters
(graphemes), and you need the algorithms to apply to real characters
(graphemes), not the approximation of a Unicode character that is a code
point.

I don't think the situation is as clean cut, as grave, and as urgent as you say.

I agree it's probably not as clean cut as I say (I'm trying to keep complicated things simple here), but it's something important to decide early because the cost of changing it increase as more code is written.


Quoting the first part of the same post (out of order):

Disagreement as that might be, a simple fact that needs to be taken into account is that as of right now all of Phobos uses UTF arrays for string representation and dchar as element type.

Besides, for one I do dispute the idea that a grapheme element is better than a dchar element for iterating over a string. The grapheme has the attractiveness of being theoretically clean but at the same time is woefully inefficient and helps languages that few D users need to work with. At least that's my perception, and we need some serious numbers instead of convincing rhetoric to make a big decision.

You'll no doubt get more performance from a grapheme-aware specialized algorithm working directly on code points than by iterating on graphemes returned as string slices. But both will give *correct* results.

Implementing a specialized algorithm of this kind becomes an optimization, and it's likely you'll want an optimized version for most string algorithms.

I'd like to have some numbers too about performance, but I have none at this time.


It's all a matter of picking one's trade-offs. Clearly ASCII is out as no serious amount of non-English text can be trafficked without diacritics. So switching to UTF makes a lot of sense, and that's what D did.

When I introduced std.range and std.algorithm, they'd handle char[] and wchar[] no differently than any other array. A lot of algorithms simply did the wrong thing by default, so I attempted to fix that situation by defining byDchar(). So instead of passing some string str to an algorithm, one would pass byDchar(str).

A couple of weeks went by in testing that state of affairs, and before late I figured that I need to insert byDchar() virtually _everywhere_. There were a couple of algorithms (e.g. Boyer-Moore) that happened to work with arrays for subtle reasons (needless to say, they won't work with graphemes at all). But by and large the situation was that the simple and intuitive code was wrong and that the correct code necessitated inserting byDchar().

So my next decision, which understandably some of the people who didn't go through the experiment may find unintuitive, was to make byDchar() the default. This cleaned up a lot of crap in std itself and saved a lot of crap in the yet-unwritten client code.

But were your algorithms *correct* in the first place? I'd argue that by making byDchar the default you've not saved yourself from any crap because dchar isn't the right layer of abstraction.


I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis. Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.


Now, thanks to the effort people have spent in this group (thank you!), I have an understanding of the grapheme issue. I guarantee that grapheme-level iteration will have a high cost incurred to it: efficiency and changes in std. The languages that need composing characters for producing meaningful text are few and far between, so it makes sense to confine support for them to libraries that are not the default, unless we find ways to not disrupt everyone else.

We all are more aware of the problem now, that's a good thing. :-)


--
Michel Fortin
[email protected]
http://michelf.com/

Reply via email to