Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin Mon, 17 Jan 2011 08:35:28 -0800

On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu<[email protected]> said:

On 1/16/11 3:20 PM, Michel Fortin wrote:

On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
<[email protected]> said:

On 1/15/11 10:45 PM, Michel Fortin wrote:

No doubt it's easier to implement it that way. The problem is that in
most cases it won't be used. How many people really know what is a
grapheme?


How many people really should care?


I think the only people who should *not* care are those who have
validated that the input does not contain any combining code point. If
you know the input *can't* contain combining code points, then it's safe
to ignore them.


I agree. Now let me ask again: how many people really should care?

As I said: all those people who are not validating the inputs to makesure they don't contain combining code points. As far as I know, no oneis doing that, so that means everybody should use algorithms capable ofhandling multi-code-point graphemes. If someone indeed is doing thisvalidation, he'll probably also be smart enough to make his algorithmsto work with dchars.

That said, no one should really have to care but those who implementthe string manipulation functions. The idea behind making the graphemethe element type is to make it easier to write grapheme-aware stringmanipulation functions, even if you don't know about graphemes. But thereality is probably more mixed than that.


- - -

I gave some thought about all this, and came to an interestingrealizations that made me refine the proposal. The new proposal isdisruptive perhaps as much as the first, but in a different way.


But first, let's state a few facts to reframe the current discussion:

Fact 1: most people don't know Unicode very well

Fact 2: most people are confused by code units, code points, graphemes,and what is a 'character'Fact 3: most people won't bother with all this, they'll just use thebasic language facilities and assume everything work correctly if it itworks correctly for them


Now, let's define two goals:

Goal 1: make most people's string operations work correctly
Goal 2: make most people's string operations work fast

To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm notsure we agree on this, but let's continue.

From the above 3 facts, we can deduce that a user won't want to bother

to using byDchar, byGrapheme, or byWhatever when using algorithms. Youwere annoyed by having to write byDchar everywhere, so changed theelement type to always be dchar and you don't have to write byDcharanymore. That's understandable and perfectly reasonable.

The problem is of course that it doesn't give you correct results. Mostof the time what you really want is to use graphemes, dchar just happento be a good approximation of that that works most of the time.

Iterating by grapheme is somewhat problematic, and it degradesperformance. Same for comparing graphemes for normalized equivalence.That's all true. I'm not too sure what we can do about that. It can beoptimized, but it's very understandable that some people won't besatisfied by the performance and will want to avoid graphemes.

Speaking of optimization, I do understand that iterating by graphemeusing the range interface won't give you the best performance. It'scertainly convenient as it enables the reuse of existing algorithmswith graphemes, but more specialized algorithms and interfaces might bemore suited.

One observation I made with having dchar as the default element type isthat not all algorithms really need to deal with dchar. If I'msearching for code point 'a' in a UTF-8 string, decoding code unitsinto code points is a waste of time. Why? because the only way torepresent code point 'a' is by having code point 'a'. And guess what?The almost same optimization can apply to graphemes: if you'researching for 'a' in a grapheme-aware manner in a UTF-8 string, all youhave to do is search for the UTF-8 code unit 'a', then check if the 'a'code unit is followed by a combining mark code point to confirm it isreally a 'a', not a composed grapheme. Iterating the string by codeunit is enough for these cases, and it'd increase performance by a lot.

So making dchar the default type is no doubt convenient because itabstracts things enough so that generic algorithms can work withstrings, but it has a performance penalty that you don't always need. Imade an example using UTF-8, it applies even more to UTF-16. And itapplies to grapheme-aware manipulations too.

This penalty with generic algorithms comes from the fact that they takea predicate of the form "a == 'a'" or "a == b", which is ill-suited forstrings because you always need to fully decode the string (by dchar orby graphemes) for the purpose of calling the predicate. Given thatcomparing characters for something else than equality or them beingpart of a set is very rarely something you do, generic algorithms missa big optimization opportunity here.


- - -

So here's what I think we should do:

Todo 1: disallow generic algorithms on naked strings: string-specificUnicode-aware algorithms should be used instead; they can share thesame name if their usage is similar

Todo 2: to use a generic algorithm with a strings, you must dress thestring using one of toDchar, toGrapheme, toCodeUnits; this way yourintentions are clear

Todo 3: string-specific algorithms can implemented as simple wrappersfor generic algorithms with the string dressed correctly for the task,or they can implement more sophisticated algorithms to increaseperformance


There's two major benefits to this approach:

Benefit 1: if indeed you really don't want the performance penalty thatcomes with checking for composed graphemes, you can bypass it at somespecific places in your code using byDchar, or you can disable italtogether by modifying the string-specific algorithms and recompilingPhobos.

Benefit 2: we don't have to rush to implementing graphemes in theUnicode-aware algorithms. Just make sure the interface forstring-specific algorithms *can* accept graphemes, and we can roll outsupport for them at a later time once we have a decent implementation.

Also, all this is leaving the question open as to what to do whensomeone uses the string as a range. In my opinion, it should eitheriterate on code units (because the string is actually an array, andbecause that's what foreach does) or simply disallow iteration (askingthat you dress the string first using toCodeUnit, toDchar, ortoGrapheme).


Do you like that more?


--
Michel Fortin
[email protected]
http://michelf.com/

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to