https://issues.dlang.org/show_bug.cgi?id=14519
--- Comment #15 from Walter Bright <[email protected]> --- (In reply to Vladimir Panteleev from comment #14) > If I understand correctly, throwing Error instead of Exception will also > solve the performance issues It still allocates memory. But it's worth thinking about. Maybe assert()? > Ditto, but the @nogc aspect can also be solved with the refcounted > exceptions spec, which will fix the problem in general. We'll see. That's still a ways off. > > 2. Same thing. (Running normalization on passwords? What the hell?) > > I did not mean Unicode normalization - it was a joke (std.algorithm will > "normalize" invalid UTF characters to the replacement character). But since > .front on strings autodecodes, feeding a string to any generic range > function in std.algorithm will cause auto-decoding (and thus, character > substitution). That can be fixed as I suggested. > > The replacement char thing was not invented by me, it is commonplace as > > users don't like their documents being wholly rejected for one or two bad > > encodings. > I know, I agree it's useful, but it needs to be opt-in. Global opt-in for foreach is not feasible. However, one can add an algorithm "validate" which throws on invalid UTF, and put that at the start of a pipeline, as in: text.validate.A.B.C.D; > > I know that many programs try to guess the encoding of random text they get. > > Doing this by only reading a few characters, and assuming the rest, is a > > strange method if one cares about the integrity of the data. > > I don't see how this is relevant, sorry. You brought up guessing the encoding of XML text by reading the start of it: "what if it was some 8-bit encoding that only LOOKED like valid UTF-8?" > > Having to constantly re-sanitize data, at every step in the pipeline, is > > going to make D programs uncompetitive speed-wise. > > I don't understand what you mean by this. You could say that any way to > handle invalid UTF can be seen as a way of sanitizing data: there will > always be a code path for what to do when invalid UTF is encountered. I > would interpret "no sanitization" as not handling invalid UTF in any way > (i.e. treating it in an undefined way). If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D never are executed. But if A does not throw, then B.C.D guaranteed to be getting valid UTF, but they still pay the penalty of the compiler thinking they can allocate memory and throw. --
