Re: string is rarely useful as a function argument

Michel Fortin Sat, 31 Dec 2011 08:50:31 -0800

On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu<[email protected]> said:

On 12/31/11 8:17 CST, Michel Fortin wrote:

At one time Java and other frameworks started to use UTF-16 as
if they were characters, that turned wrong on them. Now we know that not
even code points should be considered characters, thanks to characters
spanning on multiple code points. You might call it perfect, but for
that you have made two assumptions:


1. treating code points as characters is good enough, and
2. the performance penalty of decoding everything is tolerable


I'm not sure how you concluded I drew such assumptions.

1: Because treating UTF-8 strings as a range of code point encouragepeople to think so. 2: From things you posted on the newsgrouppreviously. Sorry I don't have the references, but it'd take too longto dig them back.

Ranges of code points might be perfect for you, but it's a tradeoff that
won't work in every situations.


Ranges can be defined to span logical glyphs that span multiple code points.

I'm talking about the default interpretation, where string ranges areranges of code units, making that tradeoff the default.

And also, I think we can agree that a logical glyph range would beterribly inefficient in practice, although it could be a nice teachingtool.

The whole concept of generic algorithms working on strings efficiently
doesn't work.


Apparently std.algorithm does.

First, it doesn't really work. It seems to work fine, but it doesn'thandle (yet) characters spanning multiple code points. To handle thiscase, you could use a logical glyph range, but that'd be quiteinefficient. Or you can improve the algorithm working on code points sothat it checks for combining characters on the edges, but then is itstill a generic algorithm?

Second, it doesn't work efficiently. Sure you can specialize thealgorithm so it does not decode all code units when it's not necessary,but then does it still classify as a generic algorithm?

My point is that *generic* algorithms cannot work *efficiently* withUnicode, not that they can't work at all. And even then, for theinneficient generic algorithm to work correctly with all input, theuser need to choose the correct Unicode representation to for theproblem at hand, which requires some general knowledge of Unicode.


Which is why I'd just discourage generic algorithms for strings.

I'm not against making strings more opaque to encourage people to use
the Unicode algorithms from the standard library instead of rolling
their own.
I'd say we're discussing making the two kinds of manipulation (encodedsequence of logical character vs. array of code units) moredistinguished from each other. That's a Good Thing(tm).

It's a good abstraction to show the theory of Unicode. But it's not theway to go if you want efficiency. For efficiency you need for eachelement in the string to use the lowest abstraction required to handlethis element, so your algorithm needs to know about the variousabstraction layers.

This is the kind of "range" I'd use to create algorithms dealing withUnicode properly:


struct UnicodeRange(U)
{
        U frontUnit() @property;
        dchar frontPoint() @property;
        immutable(U)[] frontGlyph() @property;
        
        void popFrontUnit();
        void popFrontPoint();
        void popFrontGlyph();

        ...
}

Not really a range per your definition of ranges, but basically it letsyou intermix working with units, code points, and glyphs. Add a way toslice at the unit level and a way to know the length at the unit leveland it's all I need to make an efficient parser, or any algorithmreally.

The problem with .raw is that it creates a separate range for theunits. This means you can't look at the frontUnit and then decide topop the unit and then look at the next, decide you need to decode usingfrontPoint, then call popPoint and return to looking at the front unit.

Also, I'm not sure the "glyph" part of that range is required most ofthe time, because most of the time you don't need to decode glyphs tobe glyph-aware. But it'd be nice if you wanted to count them and havingit there alongside the rest makes teaches makes users aware of them.


--
Michel Fortin
[email protected]
http://michelf.com/

Re: string is rarely useful as a function argument

Reply via email to