On 12/31/11 10:47 AM, Michel Fortin wrote:
On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu
<[email protected]> said:

On 12/31/11 8:17 CST, Michel Fortin wrote:
At one time Java and other frameworks started to use UTF-16 as
if they were characters, that turned wrong on them. Now we know that not
even code points should be considered characters, thanks to characters
spanning on multiple code points. You might call it perfect, but for
that you have made two assumptions:

1. treating code points as characters is good enough, and
2. the performance penalty of decoding everything is tolerable

I'm not sure how you concluded I drew such assumptions.

1: Because treating UTF-8 strings as a range of code point encourage
people to think so. 2: From things you posted on the newsgroup
previously. Sorry I don't have the references, but it'd take too long to
dig them back.

That's sort of difficult to refute. Anyhow, I think it's great that algorithms can use types to go down to the representation if needed, and stay up at bidirectional range level otherwise.

Ranges of code points might be perfect for you, but it's a tradeoff that
won't work in every situations.

Ranges can be defined to span logical glyphs that span multiple code
points.

I'm talking about the default interpretation, where string ranges are
ranges of code units, making that tradeoff the default.

And also, I think we can agree that a logical glyph range would be
terribly inefficient in practice, although it could be a nice teaching
tool.

Well people who want that could use byGlyph() or something. If you want glyphs, you gotta pay the price.

The whole concept of generic algorithms working on strings efficiently
doesn't work.

Apparently std.algorithm does.

First, it doesn't really work.

Oh yes it does.

It seems to work fine, but it doesn't
handle (yet) characters spanning multiple code points.

That's the job of std.range, not std.algorithm.

To handle this
case, you could use a logical glyph range, but that'd be quite
inefficient. Or you can improve the algorithm working on code points so
that it checks for combining characters on the edges, but then is it
still a generic algorithm?

Second, it doesn't work efficiently. Sure you can specialize the
algorithm so it does not decode all code units when it's not necessary,
but then does it still classify as a generic algorithm?

My point is that *generic* algorithms cannot work *efficiently* with
Unicode, not that they can't work at all. And even then, for the
inneficient generic algorithm to work correctly with all input, the user
need to choose the correct Unicode representation to for the problem at
hand, which requires some general knowledge of Unicode.

Which is why I'd just discourage generic algorithms for strings.

I think you are in a position that is defensible, but not generous and therefore undesirable. The military equivalent would be defending a fortified landfill drained by a sewer. You don't _want_ to be there. Taking your argument to its ultimate conclusion is that we give up on genericity for strings and go home.

Strings are a variable-length encoding on top of an array. That is a relatively easy abstraction to model. Currently we don't have a dedicated model for that - we offer the encoded data as a bidirectional range and also the underlying array. Algorithms that work with bidirectional ranges work out of the box. Those that can use the representation gainfully can opportunistically specialize on isSomeString!R.

You contend that that doesn't "work", and I think you're wrong. But to the extent you have a case, an abstraction could be defined for variable-length encodings, and algorithms could be defined to work with that abstraction. I thought several times about that, but couldn't gather enough motivation for the simple reason that the current approach _works_.

I'm not against making strings more opaque to encourage people to use
the Unicode algorithms from the standard library instead of rolling
their own.

I'd say we're discussing making the two kinds of manipulation (encoded
sequence of logical character vs. array of code units) more
distinguished from each other. That's a Good Thing(tm).

It's a good abstraction to show the theory of Unicode. But it's not the
way to go if you want efficiency. For efficiency you need for each
element in the string to use the lowest abstraction required to handle
this element, so your algorithm needs to know about the various
abstraction layers.

Correct.

This is the kind of "range" I'd use to create algorithms dealing with
Unicode properly:

struct UnicodeRange(U)
{
U frontUnit() @property;
dchar frontPoint() @property;
immutable(U)[] frontGlyph() @property;

void popFrontUnit();
void popFrontPoint();
void popFrontGlyph();

...
}

We already have most of that. For a string s, s[0] is frontUnit, s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is popFrontPoint. We only need to define the glyph routines.

But I think you'd be stopping short. You want generic variable-length encoding, not the above.

Not really a range per your definition of ranges, but basically it lets
you intermix working with units, code points, and glyphs. Add a way to
slice at the unit level and a way to know the length at the unit level
and it's all I need to make an efficient parser, or any algorithm really.

Except for the glpyhs implementation, we're already there. You are talking about existing capabilities!

The problem with .raw is that it creates a separate range for the units.

That's the best part about it.

This means you can't look at the frontUnit and then decide to pop the
unit and then look at the next, decide you need to decode using
frontPoint, then call popPoint and return to looking at the front unit.

Of course you can.

while (condition) {
  if (s.raw.front == someFrontUnitThatICareAbout) {
     s.raw.popFront();
     auto c = s.front;
     s.popFront();
  }
}

Now that I wrote it I'm even more enthralled with the coolness of the scheme. You essentially have access to two separate ranges on top of the same fabric.


Andrei

Reply via email to