On 10/24/2012 01:07 PM, Jonathan M Davis wrote:
On Wednesday, October 24, 2012 12:42:59 mist wrote:
On Tuesday, 23 October 2012 at 17:36:53 UTC, Simen Kjaeraas wrote:
On 2012-10-23, 19:21, mist wrote:
Hm, and all phobos functions should operate on narrow strings
as if they where not random-acessible? I am thinking about
something like commonPrefix from std.algorithm, which operates
on code points for strings.
Preferably, yes. If there are performance (or other) benefits
from
operating on code units, and it's just as safe, then operating
on code
units is ok.
Probably I don't undertsand it fully, but D approach has always
been "safe first, fast with some additional syntax". Back to
commonPrefix and take:
==========================
import std.stdio, std.traits, std.algorithm, std.range;
void main()
{
auto beer = "Пиво";
auto r1 = beer.take(2);
auto pony = "Пони";
auto r2 = commonPrefix(beer, pony);
writeln(r1);
writeln(r2);
}
==========================
First one returns 2 symbols. Second one - 3 code points and
broken string. There is no way such incosistency by-default in
standard library is understandable by a newbie.
We don't really have much choice here. As long as strings are arrays of code
units, it wouldn't work to treat them as ranges of their elements, because
that would be a complete disaster for unicode. You'd be operating on code
units rather than code points, which is almost always wrong.
There are plenty cases where it makes no difference, or iterating by
code point is harmful, or just as incorrect.
str.filter!(a=>a!='x'); // works for all str iterated by
// code point or by code unit
string x = str.filter!(a=>a!='x').array;// only works in the latter case
dstring s = "ÅA";
dstring g = s.filter!(a=>a!='A').array;
Pretty much the
only way to really solve the problem as long as strings are arrays with all of
the normal array operations is for the std.range traits (hasLength,
hasSlicing, etc.) and the range functions for arrays in std.array (e.g. front,
popFront, etc.) to treat strings as ranges of code points (dchar), which is
what they do. The result _is_ confusing, but as long as strings are arrays of
code units like they are now, to do anything else would result in incorrect
behavior.
It would result in by-code-unit behavior.
There just isn't a good solution given what strings currently are in
the language itself.
Andrei's suggestion would work if Walter could be talked into it, but that
doesn't look like it's going to happen. And making it so that strings are
structs which hold arrays of code units could work, but without language
support, it's likely to have major issues. String literals would have to
become the struct type, which could cause issue with calling C functions, and
the code breakage would be _way_ larger than with Andrei's suggestion, since
arrays of code units would no longer be strings at all.
> ...
You realize that the proposed solution is that arrays of code units
would no longer be arrays of code units?