29-May-2014 02:10, Jonathan M Davis via Digitalmars-d-announce пишет:
On Tue, 27 May 2014 06:42:41 -1000
Andrei Alexandrescu via Digitalmars-d-announce
<[email protected]> wrote:
>
http://www.reddit.com/r/programming/comments/26m8hy/scott_meyers_dconf_2014_keynote_the_last_thing_d/
>
> https://news.ycombinator.com/newest (search that page, if not found
> click "More" and search again)
>
> https://www.facebook.com/dlang.org/posts/855022447844771
>
> https://twitter.com/D_Programming/status/471330026168651777
Fortunately, for the most part, I think that we've avoided the types of
inconsistencies that Scott describes for C++, but we do definitely have some
of our own. The ones that come to mind at the moment are:
Not talking about other moments, but Unicode kind of caught my eye..
6. The situation with ranges and string is kind of ugly, with them being
treated as ranges of code points. I don't know what the correct solution to
this is, since treating them as ranges of code units promotes efficiency but
makes code more error-prone, whereas treating them as ranges of graphemes
would just cost too much.
This is gross oversimplification of the matter. There is no more
correct, less correct. Each algorithm requires its own level of
consideration, if there is a simple truism about Unicode it is:
Never operate on a single character, rather operate on slices of text.
To sum up the situation:
Unicode standard defines *all* of its algorithms in terms of code points
and some use grapheme clusters. It never says anything about code units
beyond mapping of code units --> code point. So whether or not you
should actually decode is up to the implementation.
Ranges of code points is _mostly_ correct but
still
incorrect and _more_ efficient than graphemes but still quite a bit less
efficient than code units. So, it's kind of like it's got the best and worst
of both worlds. The current situation causes inconsistencies with everything
else (forcing us to use isNarrowString all over the place) and definitely
requires frequent explaining, but it does prevent some classes of problems.
So, I don't know. I used to be in favor of the current situation, but at
this
point, if we could change it, I think that I'd argue in faver of just
treating
them as ranges of code units and then have wrappers for ranges of code
points
or graphemes.
Agreed. The simple dream of automatically decoding UTF and staying
"Unicode correct" is a failure.
It seems like the current situation promotes either using
ubyte[] (if you care about efficiency) or the new grapheme facilities in
std.uni if you care about correctness, whereas just using strings as
ranges of
dchar is probably a bad idea unless you just don't want to deal with any of
the Unicode stuff, don't care all that much about efficiency, and are
willing
have bugs in the areas where operating at the code point level is incorrect.
The worst thing about current situation is any generic code that works
on UTF ranges has to jump through unbelievable amount of hoops to undo
"string has no length" madness.
I think what we should do is define an StringRange or some such, that
will at least make the current special case of string more generic.
--
Dmitry Olshansky