Re: Proposal for fixing dchar ranges

John Colvin Tue, 11 Mar 2014 02:04:29 -0700

On Monday, 10 March 2014 at 22:15:34 UTC, Steven Schveighofferwrote:

On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin<[email protected]> wrote:
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighofferwrote:
I proposed this inside the long "major performance problemwith std.array.front," I've also proposed it before, a longtime ago.
But seems to be getting no attention buried in that thread,not even negative attention :)
An idea to fix the whole problems I see with char[] beingtreated specially by phobos: introduce an actual string type,with char[] as backing, that is a dchar range, that actuallydictates the rules we want. Then, make the compiler use thistype for literals.
e.g.:

struct string {
  immutable(char)[] representation;
  this(char[] data) { representation = data;}
  ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:
1. No more issues with foreach(c; "cassé"), it iterates viadchar2. No more issues with "cassé"[4], it is a static compilererror.
3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool thecompiler.6. Any other special rules we come up with can be dictated bythe library, and not ignored by the compiler.
Note, std.algorithm.copy(string1, mutablestring) will stilldecode/encode, but it's more explicit. It's EXPLICITLY adchar range. Use std.algorithm.copy(string1.representation,mutablestring.representation) will avoid the issues.
I imagine only code that is currently UTF ignorant willbreak, and that code is easily 'fixed' by adding the'representation' qualifier.
-Steve
just to check I understand this fully:

in this new scheme, what would this do?

auto s = "cassé".representation;
foreach(i, c; s) write(i, ':', c, ' ');
writeln(s);

Currently - without the .representation - I get

0:c 1:a 2:s 3:s 4:e 5:̠6:`
cassé

or, to spell it out a bit more:
0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
cassé
The plan is for foreach on s to iterate by char, and foreach on"cassé" to iterate by dchar.
What this means is the accent will be iterated separately fromthe e, and likely gets put onto the colon after 5. However, thehalf code-units that has no meaning anywhere (xCC and X81)would not be iterated.
In your above code, using .representation would be equivalentto what it is now without .representation (i.e. over char), andwithout .representation would be equivalent to this on today'scompiler (except faster):
foreach(i, dchar c; s)

-Steve


Awesome, let's do this :)

Re: Proposal for fixing dchar ranges

Reply via email to