On 5/26/2016 9:00 AM, Andrei Alexandrescu wrote:
My thesis: the D1 design decision to represent strings as char[] was disastrous
and probably one of the largest weaknesses of D1. The decision in D2 to use
immutable(char)[] for strings is a vast improvement but still has a number of
issues.

The mutable vs immutable has nothing to do with autodecoding.


On 05/12/2016 04:15 PM, Walter Bright wrote:
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
2. Every time one wants an algorithm to work with both strings and
ranges, you wind up special casing the strings to defeat the
autodecoding, or to decode the ranges. Having to constantly special case
it makes for more special cases when plugging together components. These
issues often escape detection when unittesting because it is convenient
to unittest only with arrays.

This is a consequence of 1. It is at least partially fixable.

It's a consequence of autodecoding, not arrays.


4. Autodecoding is slow and has no place in high speed string processing.
I would agree only with the amendment "...if used naively", which is important.
Knowledge of how autodecoding works is a prerequisite for writing fast string
code in D.

Having written high speed string processing code in D, that also deals with unicode (i.e. Warp), the only knowledge of autodecoding needed was how to have it not happen. Autodecoding made it slower than necessary in every case it was used. I found no place in Warp where autodecoding was desirable.


Also, little code should deal with one code unit or code point at a
time; instead, it should use standard library algorithms for searching, matching
etc.

That doesn't work so well. There always seems to be a need for custom string processing. Worse, when pipelining strings, the autodecoding changes the type to dchar, which then needs to be re-encoded into the result.

The std.string algorithms I wrote all work much better (i.e. faster) without autodecoding, while maintaining proper Unicode support. I.e. the autodecoding did not benefit the algorithms at all, and if the user is to use standard algorithms instead of custom ones, then autodecoding is not necessary.


When needed, iterating every code unit is trivially done through indexing.

This implies replacing pipelining with loops, and also falls apart if indexing is redone to index by code points.


Also allow me to point that much of the slowdown can be addressed tactically.
The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore
easily speculated. We can and we should arrange code to minimize impact.

I.e. special case the code to avoid autodecoding.

The trouble is that the low level code cannot avoid autodecoding, as it happens before the low level code gets it. This is conceptually backwards, and winds up requiring every algorithm to special case strings, even when completely unnecessary. (The 'copy' algorithm is an example of utterly unnecessary decoding.)

When teaching people how to write algorithms, having to write every one twice, once for ranges and arrays, and a specialization for strings even when decoding is never necessary (such as for 'copy'), is embarrassing.


5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the right thing instead
of having the user wonder separately for each case. These uses don't need
decoding, and the standard library correctly doesn't involve it (or if it
currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even though inside
it may choose to use code units when admissible. Leaving such a decision to the
library seems like a wise thing to do.

Running my char[] through a pipeline and having it come out sometimes as char[] and sometimes dchar[] and sometimes ubyte[] is hidden and surprising behavior.


6. Autodecoding has two choices when encountering invalid code units -
throw or produce an error dchar. Currently, it throws, meaning no
algorithms using autodecode can be made nothrow.
Agreed. This is probably the most glaring mistake. I think we should open a
discussion no fixing this everywhere in the stdlib, even at the cost of breaking
code.

A third option is to pass the invalid code units through unmolested, which won't work if autodecoding is used.


7. Autodecode cannot be used with unicode path/filenames, because it is
legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
out in the wild that pure Unicode is not universal - there's lots of
dirty Unicode that should remain unmolested, and autocode does not play
with that.
If paths are not UTF-8, then they shouldn't have string type (instead use
ubyte[] etc). More on that below.

Requiring code units to be all 100% valid is not workable, nor is redoing them to be ubytes. More on that below.


8. In my work with UTF-8 streams, dealing with autodecode has caused me
considerably extra work every time. A convenient timesaver it ain't.
Objection. Vague.

Sorry I didn't log the time I spent on it.


9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
importing std.array one way or another, and then autodecode is there.
Turning off autodecoding is as easy as inserting .representation after any
string.

.representation changes the type to ubyte[]. All knowledge that this is a Unicode string then gets lost for the rest of the pipeline.


(Not to mention using indexing directly.)

Doesn't work if you're pipelining.


10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
benefit of being arrays in the first place.
First off, you always have the option with .representation. That's a great name
because it gives you the type used to represent the string - i.e. an array of
integers of a specific width.

I found .representation to be unworkable because it changed the type.


11. Indexing an array produces different results than autodecoding,
another glaring special case.
This is a direct consequence of the fact that string is immutable(char)[] and
not a specific type. That error predates autodecoding.

Even if it is made a special type, the problem of what an index means will remain. Of course, indexing by code point is an O(n) operation, which I submit is surprising and shouldn't be supported as [i] even by a special type (for the same reason that indexing of linked lists is frowned upon). Giving up indexing means giving up efficient slicing, which would be a major downgrade for D.


Overall, I think the one way to make real steps forward in improving string
processing in the D language is to give a clear answer of what char, wchar, and
dchar mean.

They mean code units. This is not ambiguous. How a code unit is different from a ubyte:

A. I know you hate bringing up my personal experience, but here goes. I've programmed in C forever. In C, char is used for both small integers and characters. It's always been a source of confusion, and sometimes bugs, to conflate the two:

     struct S { char field; };

Which is it, a character or a small integer? I have to rely on reading the code. It's a definite improvement in D that they are distinguished, and I feel that improvement every time I have to deal with C/C++ code and see 'char' used as a small integer instead of a character.

B. Overloading is different, and that's good. For example, writeln(T[]) produces different results for char[] and ubyte[], and this is unsurprising and expected. It "just works".

C. More overloading:

      writeln('a');

Does anyone want that to print 96? Does anyone really want 'a' to be of type dchar? (The trouble with that is type inference when building up more complex types, as you'll wind up with hidden dchar[] if not careful. My experience with dchar[] is it is almost never desirable, as it is too memory hungry.)

Reply via email to