On 12/31/11 8:17 CST, Michel Fortin wrote:
On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu
<[email protected]> said:

On 12/31/11 2:04 AM, Walter Bright wrote:

We're chasing phantoms here, and I worry a lot about over-engineering
trivia.

I disagree. I understand that seems trivia to you, but that doesn't
make your opinion any less wrong, not to mention provincial through
insistence it's applicable beyond a small team of experts. Again: I
know no other - I literally mean not one - person who writes string
code like you do (and myself after learning it from you); the current
system is adequate; the proposed system is perfect - save for breaking
backwards compatibility, which makes the discussion moot. But it being
moot does not afford me to concede this point. I am right.

Perfect?

Sorry, I exaggerated. I meant "a net improvement while keeping simplicity".

At one time Java and other frameworks started to use UTF-16 as
if they were characters, that turned wrong on them. Now we know that not
even code points should be considered characters, thanks to characters
spanning on multiple code points. You might call it perfect, but for
that you have made two assumptions:

1. treating code points as characters is good enough, and
2. the performance penalty of decoding everything is tolerable

I'm not sure how you concluded I drew such assumptions.

Ranges of code points might be perfect for you, but it's a tradeoff that
won't work in every situations.

Ranges can be defined to span logical glyphs that span multiple code points.

The whole concept of generic algorithms working on strings efficiently
doesn't work.

Apparently std.algorithm does.

Applying generic algorithms to strings by treating them as
a range of code points is both wasteful (because it forces you to decode
everything) and incomplete (because of multi-code-point characters) and
it should be avoided.

An algorithm that gains by accessing the encoding can do so - and indeed some do. Spanning multi-code-point characters is a matter of defining the range appropriately; it doesn't break the abstraction.

Algorithms working on Unicode strings should be
designed with Unicode in mind. And the best way to design efficient
Unicode algorithms is to access the array of code units directly and
read each character at the level of abstraction required and know what
you're doing.

As I said, that's happening already.

I'm not against making strings more opaque to encourage people to use
the Unicode algorithms from the standard library instead of rolling
their own.

I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).

But I doubt the current approach of using .raw alone will
prevent many from doing dumb things.

I agree. But I think it would be a sensible improvement over now, when you get to do a ton of dumb things with much more ease.

On the other side I'm sure it'll
make it it more complicated to write Unicode algorithms because
accessing and especially slicing the raw content of char[] will become
tiresome. I'm not convinced it's a net win.

Many Unicode algorithms don't need slicing. Those that do carefully mix manipulation of code points with manipulation of representation. It is a net win that the two operations are explicitly distinguished.

As for Walter being the only one coding by looking at the code units
directly, that's not true. All my parser code look at code units
directly and only decode to code points where necessary (just look at
the XML parsing code I posted a while ago to get an idea to how it can
apply to ranges). And I don't think it's because I've seen Walter code
before, I think it is because I know how Unicode works and I want to
make my parser efficient. I've done the same for a parser in C++ a while
ago. I can hardly imagine I'm the only one (with Walter and you). I
think this is how efficient algorithms dealing with Unicode should be
written.

Congratulations.


Andrei

Reply via email to