Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin Sun, 16 Jan 2011 13:25:41 -0800

On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu<[email protected]> said:

On 1/15/11 10:45 PM, Michel Fortin wrote:

No doubt it's easier to implement it that way. The problem is that in
most cases it won't be used. How many people really know what is a
grapheme?


How many people really should care?

I think the only people who should *not* care are those who havevalidated that the input does not contain any combining code point. Ifyou know the input *can't* contain combining code points, then it'ssafe to ignore them.

If we don't make correct Unicode handling the default, someday someoneis going to ask a developer to fix a problem where his system doesn'thandle some text correctly. Later that day, he'll come to therealization that almost none of his D code and none of the D librarieshe use handle unicode correctly, and he'll say: can't fix this. Hispeer working on a similar Objective-C program will have a good laugh.

Sure, correct Unicode handling is slower and more complicated toimplement, but at least you know you'll get the right results.

Of those, how many will forget to use byGrapheme at one time
or another? And so in most programs string manipulation will misbehave
in the presence of combining characters or unnormalized strings.


But most strings don't contain combining characters or unnormalized strings.

I think we should expect combining marks to be used more and more asour OS text system and fonts start supporting them better. Them beingrare might be true today, but what do you know about tomorrow?

A few years ago, many Unicode symbols didn't even show up correctly onWindows. Today, we have Unicode domain names and people start puttingfunny symbols in them (for instance: <http://◉.ws>). I haven't seen ityet, but we'll surely see combining characters in domain names soonenough (if only as a way to make fun of programs that can't handleUnicode correctly). Well, let me be the first to make fun of suchprograms: <http://☺̭̏.michelf.com/>.

Also, not all combining characters are marks meant to be used by someforeign languages. Some are used for mathematics for instance. Or youcould use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlayindicating some kind of prohibition.

If you want to help D programmers write correct code when it comes to
Unicode manipulation, you need to help them iterate on real characters
(graphemes), and you need the algorithms to apply to real characters
(graphemes), not the approximation of a Unicode character that is a code
point.

I don't think the situation is as clean cut, as grave, and as urgent asyou say.

I agree it's probably not as clean cut as I say (I'm trying to keepcomplicated things simple here), but it's something important to decideearly because the cost of changing it increase as more code is written.



Quoting the first part of the same post (out of order):

Disagreement as that might be, a simple fact that needs to be takeninto account is that as of right now all of Phobos uses UTF arrays forstring representation and dchar as element type.
Besides, for one I do dispute the idea that a grapheme element isbetter than a dchar element for iterating over a string. The graphemehas the attractiveness of being theoretically clean but at the sametime is woefully inefficient and helps languages that few D users needto work with. At least that's my perception, and we need some seriousnumbers instead of convincing rhetoric to make a big decision.

You'll no doubt get more performance from a grapheme-aware specializedalgorithm working directly on code points than by iterating ongraphemes returned as string slices. But both will give *correct*results.

Implementing a specialized algorithm of this kind becomes anoptimization, and it's likely you'll want an optimized version for moststring algorithms.

I'd like to have some numbers too about performance, but I have none atthis time.

It's all a matter of picking one's trade-offs. Clearly ASCII is out asno serious amount of non-English text can be trafficked withoutdiacritics. So switching to UTF makes a lot of sense, and that's what Ddid.
When I introduced std.range and std.algorithm, they'd handle char[] andwchar[] no differently than any other array. A lot of algorithms simplydid the wrong thing by default, so I attempted to fix that situation bydefining byDchar(). So instead of passing some string str to analgorithm, one would pass byDchar(str).
A couple of weeks went by in testing that state of affairs, and beforelate I figured that I need to insert byDchar() virtually _everywhere_.There were a couple of algorithms (e.g. Boyer-Moore) that happened towork with arrays for subtle reasons (needless to say, they won't workwith graphemes at all). But by and large the situation was that thesimple and intuitive code was wrong and that the correct codenecessitated inserting byDchar().
So my next decision, which understandably some of the people who didn'tgo through the experiment may find unintuitive, was to make byDchar()the default. This cleaned up a lot of crap in std itself and saved alot of crap in the yet-unwritten client code.

But were your algorithms *correct* in the first place? I'd argue thatby making byDchar the default you've not saved yourself from any crapbecause dchar isn't the right layer of abstraction.

I think it's reasonable to understand why I'm happy with the currentstate of affairs. It is better than anything we've had before andbetter than everything else I've tried.

It is indeed easy to understand why you're happy with the current stateof affairs: you never had to deal with multi-code-point character andcan't imagine yourself having to deal with them on a semi-frequentbasis. Other people won't be so happy with this state of affairs, butthey'll probably notice only after most of their code has been writtenunaware of the problem.

Now, thanks to the effort people have spent in this group (thank you!),I have an understanding of the grapheme issue. I guarantee thatgrapheme-level iteration will have a high cost incurred to it:efficiency and changes in std. The languages that need composingcharacters for producing meaningful text are few and far between, so itmakes sense to confine support for them to libraries that are not thedefault, unless we find ways to not disrupt everyone else.


We all are more aware of the problem now, that's a good thing. :-)


--
Michel Fortin
[email protected]
http://michelf.com/

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to