On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
Would it be much work to show have example code or even an
experimental module that gets rid of auto-decoding, so we
could see what would be affected in general and how actual
code we have would be affected by it?
The topic keeps coming up again and again, and while I'm in
favor of anything that enhances performance, I'm afraid of
having to refactor large chunks of my code. However, this fear
may be unfounded, but I would need some examples to visualize
the problem.
Honestly, most code won't care. If we just switched out all of
the auto-decoding right now, pretty much anything using only
ASCII would just work, and most anything that's trying to
manipulate ASCII characters in a Unicode string will just work,
whereas code that's specifically manipulating Unicode
characters might have problems (e.g. comparing front with a
dchar will no longer have the same result, since front would
just be the first code unit rather than necessarily the first
code point). Since most Phobos range-based functions which
operate on strings are special-cased on strings already, many
of them would continue to just work (e.g. find returns the same
range type as what's passed to it even if it's given a string,
so it might just work with the change, or it might need to be
tweaked slightly), and those that would then generally either
need to call encode on an argument to make it match the string
type in the cases string types mix (e.g. "foo".find("fo"d)
would need to call encode on "fo"d to make it a string for
comparison), or the caller would need to use std.utf.byDchar or
std.uni.byGrapheme to operate on code points or graphemes
rather than code units.
The two biggest places in Phobos that would potentially have
problems are functions that special-cased strings but still
used front and those which have to return a new range type.
e.g. filter would be a good example, because it's forced to
return a new range type. Right now, it would filter on dchars,
but with the change, it would filter on the code unit type
(most typically char). If you're filtering on ASCII characters,
it wouldn't matter aside from the fact that the resulting range
would have an element type of char rather than dchar, but if
you're filtering on Unicode characters, it wouldn't work
anymore. For situations like that, you'd be forced do use
std.utf.byDchar or std.uni.byGrapheme. However, since most
string code tends to operate on substrings rather than
characters, I don't know how common it even is to use a
function like filter on a string (as opposed to a range of
strings). Such code might actually be fairly rare.
So, there _are_ a few functions which stop working the same way
in a potentially silent manner if we just made it so that front
didn't autodecode anymore. However, in general, because Phobos
almost always special-cases strings, calls to Phobos functions
probably wouldn't need to change in most cases, and when they
do, a call to byDchar would restore the old behavior. But of
course, we'd want to do the transition in a way that didn't
result in silent behavioral changes that would break code, even
though in most cases, it wouldn't matter, because most code
will be operating on ASCII strings even if the strings
themselves contain Unicode - e.g.
unicodeString.find(asciiString) is far more common than
unicodeString.find(otherUnicodeString).
I suspect that the code that's at the greatest risk is code
that checks for is(Unqual!(ElementType!Range) == dchar) to
operate on strings and wrapper ranges around strings, since it
would then only match the cases where byDchar had been used. In
general though, the code that's going to run into the most
trouble is user code that contains range-based functions
similar to what you might find in Phobos rather than code
that's simply using the Phobos functions like startsWith and
find - i.e. if you're writing range-base code that worries
about doing stuff like special-casing strings or which
specifically needs to operate on code points, then you're going
to have to make changes, whereas to a great extent, if all
you're doing is passing strings to Phobos functions, your code
will tend to just work.
To actually see what the impact would be, we'd have to just
change Phobos, I think, and then see what the impact was on
user code. It could be surprising how much or how little it
affects things, though in most cases, I expect that it'll mean
that code will just work. And if we really wanted to do that,
we could create a version flag that turned of autodecoding and
version the changes in Phobos appropriately to see what we got.
In many cases, if we simply made sure that Phobos functions
which special-cased strings didn't use front directly but
instead didn't care whether they were operating on ranges of
char, wchar, or dchar, then we wouldn't even need to version
anything (e.g. find could easily be made to work that way if it
doesn't already), but some functions (like filter) would need
to be versioned differently.
So, maybe what we need to do to start is to just go through
Phobos and make as many functions as possible not care about
whether they're dealing with strings as ranges of char, wchar,
or dchar. And at least then, we'd minimize how much code would
have to be versioned differently if we were to test out getting
rid of autodecoding with versioning.
- Jonathan M Davis
This sounds like a good starting point for a transition plan. One
important thing, though, would be to do some benchmarking with
and without autodecoding, to see if it really boosts performance
in a way that would justify the transition.