Re: The Case Against Autodecode

Walter Bright via Digitalmars-d Fri, 27 May 2016 10:16:17 -0700

On 5/26/2016 9:00 AM, Andrei Alexandrescu wrote:

My thesis: the D1 design decision to represent strings as char[] was disastrous
and probably one of the largest weaknesses of D1. The decision in D2 to use
immutable(char)[] for strings is a vast improvement but still has a number of
issues.


The mutable vs immutable has nothing to do with autodecoding.

On 05/12/2016 04:15 PM, Walter Bright wrote:

On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
2. Every time one wants an algorithm to work with both strings and
ranges, you wind up special casing the strings to defeat the
autodecoding, or to decode the ranges. Having to constantly special case
it makes for more special cases when plugging together components. These
issues often escape detection when unittesting because it is convenient
to unittest only with arrays.


This is a consequence of 1. It is at least partially fixable.


It's a consequence of autodecoding, not arrays.

4. Autodecoding is slow and has no place in high speed string processing.

I would agree only with the amendment "...if used naively", which is important.
Knowledge of how autodecoding works is a prerequisite for writing fast string
code in D.

Having written high speed string processing code in D, that also deals withunicode (i.e. Warp), the only knowledge of autodecoding needed was how to haveit not happen. Autodecoding made it slower than necessary in every case it wasused. I found no place in Warp where autodecoding was desirable.

Also, little code should deal with one code unit or code point at a
time; instead, it should use standard library algorithms for searching, matching
etc.

That doesn't work so well. There always seems to be a need for custom stringprocessing. Worse, when pipelining strings, the autodecoding changes the type todchar, which then needs to be re-encoded into the result.

The std.string algorithms I wrote all work much better (i.e. faster) withoutautodecoding, while maintaining proper Unicode support. I.e. the autodecodingdid not benefit the algorithms at all, and if the user is to use standardalgorithms instead of custom ones, then autodecoding is not necessary.

When needed, iterating every code unit is trivially done through indexing.

This implies replacing pipelining with loops, and also falls apart if indexingis redone to index by code points.

Also allow me to point that much of the slowdown can be addressed tactically.
The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore
easily speculated. We can and we should arrange code to minimize impact.


I.e. special case the code to avoid autodecoding.

The trouble is that the low level code cannot avoid autodecoding, as it happensbefore the low level code gets it. This is conceptually backwards, and winds uprequiring every algorithm to special case strings, even when completelyunnecessary. (The 'copy' algorithm is an example of utterly unnecessary decoding.)

When teaching people how to write algorithms, having to write every one twice,once for ranges and arrays, and a specialization for strings even when decodingis never necessary (such as for 'copy'), is embarrassing.

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right thing instead
of having the user wonder separately for each case. These uses don't need
decoding, and the standard library correctly doesn't involve it (or if it
currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even though inside
it may choose to use code units when admissible. Leaving such a decision to the
library seems like a wise thing to do.

Running my char[] through a pipeline and having it come out sometimes as char[]and sometimes dchar[] and sometimes ubyte[] is hidden and surprising behavior.

6. Autodecoding has two choices when encountering invalid code units -
throw or produce an error dchar. Currently, it throws, meaning no
algorithms using autodecode can be made nothrow.

Agreed. This is probably the most glaring mistake. I think we should open a
discussion no fixing this everywhere in the stdlib, even at the cost of breaking
code.

A third option is to pass the invalid code units through unmolested, which won'twork if autodecoding is used.

7. Autodecode cannot be used with unicode path/filenames, because it is
legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
out in the wild that pure Unicode is not universal - there's lots of
dirty Unicode that should remain unmolested, and autocode does not play
with that.

If paths are not UTF-8, then they shouldn't have string type (instead use
ubyte[] etc). More on that below.

Requiring code units to be all 100% valid is not workable, nor is redoing themto be ubytes. More on that below.

8. In my work with UTF-8 streams, dealing with autodecode has caused me
considerably extra work every time. A convenient timesaver it ain't.

Objection. Vague.


Sorry I didn't log the time I spent on it.

9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
importing std.array one way or another, and then autodecode is there.

Turning off autodecoding is as easy as inserting .representation after any
string.

.representation changes the type to ubyte[]. All knowledge that this is aUnicode string then gets lost for the rest of the pipeline.

(Not to mention using indexing directly.)


Doesn't work if you're pipelining.

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
benefit of being arrays in the first place.

First off, you always have the option with .representation. That's a great name
because it gives you the type used to represent the string - i.e. an array of
integers of a specific width.


I found .representation to be unworkable because it changed the type.

11. Indexing an array produces different results than autodecoding,
another glaring special case.

This is a direct consequence of the fact that string is immutable(char)[] and
not a specific type. That error predates autodecoding.

Even if it is made a special type, the problem of what an index means willremain. Of course, indexing by code point is an O(n) operation, which I submitis surprising and shouldn't be supported as [i] even by a special type (for thesame reason that indexing of linked lists is frowned upon). Giving up indexingmeans giving up efficient slicing, which would be a major downgrade for D.

Overall, I think the one way to make real steps forward in improving string
processing in the D language is to give a clear answer of what char, wchar, and
dchar mean.

They mean code units. This is not ambiguous. How a code unit is different from aubyte:

A. I know you hate bringing up my personal experience, but here goes. I'veprogrammed in C forever. In C, char is used for both small integers andcharacters. It's always been a source of confusion, and sometimes bugs, toconflate the two:


     struct S { char field; };

Which is it, a character or a small integer? I have to rely on reading the code.It's a definite improvement in D that they are distinguished, and I feel thatimprovement every time I have to deal with C/C++ code and see 'char' used as asmall integer instead of a character.

B. Overloading is different, and that's good. For example, writeln(T[]) producesdifferent results for char[] and ubyte[], and this is unsurprising and expected.It "just works".


C. More overloading:

      writeln('a');

Does anyone want that to print 96? Does anyone really want 'a' to be of typedchar? (The trouble with that is type inference when building up more complextypes, as you'll wind up with hidden dchar[] if not careful. My experience withdchar[] is it is almost never desirable, as it is too memory hungry.)

Re: The Case Against Autodecode

Reply via email to