On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu <[email protected]> said:

On 12/31/11 8:17 CST, Michel Fortin wrote:
At one time Java and other frameworks started to use UTF-16 as
if they were characters, that turned wrong on them. Now we know that not
even code points should be considered characters, thanks to characters
spanning on multiple code points. You might call it perfect, but for
that you have made two assumptions:

1. treating code points as characters is good enough, and
2. the performance penalty of decoding everything is tolerable

I'm not sure how you concluded I drew such assumptions.

1: Because treating UTF-8 strings as a range of code point encourage people to think so. 2: From things you posted on the newsgroup previously. Sorry I don't have the references, but it'd take too long to dig them back.

Ranges of code points might be perfect for you, but it's a tradeoff that
won't work in every situations.

Ranges can be defined to span logical glyphs that span multiple code points.

I'm talking about the default interpretation, where string ranges are ranges of code units, making that tradeoff the default.

And also, I think we can agree that a logical glyph range would be terribly inefficient in practice, although it could be a nice teaching tool.

The whole concept of generic algorithms working on strings efficiently
doesn't work.

Apparently std.algorithm does.

First, it doesn't really work. It seems to work fine, but it doesn't handle (yet) characters spanning multiple code points. To handle this case, you could use a logical glyph range, but that'd be quite inefficient. Or you can improve the algorithm working on code points so that it checks for combining characters on the edges, but then is it still a generic algorithm?

Second, it doesn't work efficiently. Sure you can specialize the algorithm so it does not decode all code units when it's not necessary, but then does it still classify as a generic algorithm?

My point is that *generic* algorithms cannot work *efficiently* with Unicode, not that they can't work at all. And even then, for the inneficient generic algorithm to work correctly with all input, the user need to choose the correct Unicode representation to for the problem at hand, which requires some general knowledge of Unicode.

Which is why I'd just discourage generic algorithms for strings.


I'm not against making strings more opaque to encourage people to use
the Unicode algorithms from the standard library instead of rolling
their own.

I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).

It's a good abstraction to show the theory of Unicode. But it's not the way to go if you want efficiency. For efficiency you need for each element in the string to use the lowest abstraction required to handle this element, so your algorithm needs to know about the various abstraction layers.

This is the kind of "range" I'd use to create algorithms dealing with Unicode properly:

struct UnicodeRange(U)
{
        U frontUnit() @property;
        dchar frontPoint() @property;
        immutable(U)[] frontGlyph() @property;
        
        void popFrontUnit();
        void popFrontPoint();
        void popFrontGlyph();

        ...
}

Not really a range per your definition of ranges, but basically it lets you intermix working with units, code points, and glyphs. Add a way to slice at the unit level and a way to know the length at the unit level and it's all I need to make an efficient parser, or any algorithm really.

The problem with .raw is that it creates a separate range for the units. This means you can't look at the frontUnit and then decide to pop the unit and then look at the next, decide you need to decode using frontPoint, then call popPoint and return to looking at the front unit.

Also, I'm not sure the "glyph" part of that range is required most of the time, because most of the time you don't need to decode glyphs to be glyph-aware. But it'd be nice if you wanted to count them and having it there alongside the rest makes teaches makes users aware of them.

--
Michel Fortin
[email protected]
http://michelf.com/

Reply via email to