On Wednesday, 29 April 2015 at 15:13:15 UTC, Jonathan M Davis
wrote:
On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
This sounds like a good starting point for a transition plan.
One important thing, though, would be to do some benchmarking
with and without autodecoding, to see if it really boosts
performance in a way that would justify the transition.
Well, personally, I think that it's worth it even if the
performance is identical (and it's a guarantee that it's going
to be better without autodecoding - it's just a question of how
much better - since it's going to have less work to do without
autodecoding). Simply operating at the code point level like we
do now is the worst of all worlds in terms of flexibility and
correctness. As long as the Unicode is normalized, operating at
the code unit level is the most efficient, and decoding is
often unnecessary for correctness, and if you need to decode,
then you really need to go up to the grapheme level in order to
be operating on the full character, meaning that operating on
code points really has the same problems as operating on code
units as far as correctness goes. So, it's less performant
without actually being correct. It just gives the illusion of
correctness.
By treating strings as ranges of code units, you don't take a
performance hit when you don't need to, and it forces you to
actually consider something like byDchar or byGrapheme if you
want to operate on full, Unicode characters. It's similar to
how operating on UTF-16 code units as if they were characters
(as Java and C# generally do) frequently gives the incorrect
impression that you're handling Unicode correctly, because you
have to work harder at coming up with characters that can't fit
in a single code unit, whereas with UTF-8, anything but ASCII
is screwed if you treat code units as code points. Treating
code points as if they were full characters like we're doing
now in Phobos with ranges just makes it that much harder to
notice that you're not handling Unicode correctly.
Also, treating strings as ranges of code units makes it so that
they're not so special and actually are treated like every
other type of array, which eliminates a lot of the special
casing that we're forced to do right now, and it eliminates all
of the confusion that folks keep running into when string
doesn't work with many functions, because it's not a
random-access range or doesn't have length, or because the
resulting range isn't the same type (copy would be a prime
example of a function that doesn't work with char[] when it
should). By leaving in autodecoding, we're basically leaving in
technical debt in D permanently. We'll forever have to be
explaining it to folks and forever have to be working around it
in order to achieve either performance or correctness.
What we have now isn't performant, correct, or flexible, and
we'll be forever paying for that if we don't get rid of
autodecoding.
I don't criticize Andrei in the least for coming up with it,
since if you don't take graphemes into account (and he didn't
know about them at the time), it seems like a great idea and
allows us to be correct by default and performant if we put
some effort into, but after having seen how it's worked out,
how much code has to be special-cased, how much confusion there
is over it, and how it's not actually correct anyway, I think
that it's quite clear that autodecoding was a mistake. And at
this point, it's mainly a question of how we can get rid of it
without being too disruptive and whether we can convince Andrei
that it makes sense to make the change, since he seems to still
think that autodecoding is fine in spite of the fact that it's
neither performant nor correct.
It may be that the decision will be that it's too disruptive to
remove autodecoding, but I think that that's really a question
of whether we can find a way to do it that doesn't break tons
of code rather than whether it's worth the performance or
correctness gain.
- Jonathan M Davis
Ok, I see. Well, if we don't want to repeat C++'s mistakes, we
should fix it before it's too late. Since I'm dealing a lot with
strings (non ASCII) and depend on Unicode (and correctness!), I
would be more than happy to test any changes to Phobos with my
programs to see if it screws up anything.