https://issues.dlang.org/show_bug.cgi?id=14519
--- Comment #16 from Vladimir Panteleev <[email protected]> --- (In reply to Walter Bright from comment #15) > It still allocates memory. But it's worth thinking about. Maybe assert()? Sure. > > I did not mean Unicode normalization - it was a joke (std.algorithm will > > "normalize" invalid UTF characters to the replacement character). But since > > .front on strings autodecodes, feeding a string to any generic range > > function in std.algorithm will cause auto-decoding (and thus, character > > substitution). > > That can be fixed as I suggested. Sorry, I'm not following. Which suggestion here will fix what in what way? > Global opt-in for foreach is not feasible. I agree - some libraries will expect one thing, and others another. > However, one can add an algorithm > "validate" which throws on invalid UTF, and put that at the start of a > pipeline, as in: > > text.validate.A.B.C.D; This is part of a solution. There also needs to be a way to ensure that validate was called, which is the hard part. > You brought up guessing the encoding of XML text by reading the start of it: > "what if it was some 8-bit encoding that only LOOKED like valid UTF-8?" No, that's not what I meant. UTF-8 and old 8-bit encodings (ISO 8859-*, Windows-125*) both use the high bit in the byte to indicate Unicode. Consider a program that expects an UTF-8 document, but is actually fed one in an 8-bit encoding: it is possible (although unlikely) that text that is actually in an 8-bit encoding may be successfully interpreted as a valid UTF-8 stream. Thus, invalid UTF-8 can indicate a problem with the entire document, and not just the immediate sequence of bytes. > If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D > never are executed. But if A does not throw, then B.C.D guaranteed to be > getting valid UTF, but they still pay the penalty of the compiler thinking > they can allocate memory and throw. OK, so you're saying that we can somehow automatically remove the cost of handling invalid UTF-8 if we know that the UTF-8 we're getting is valid? I don't see how this would work in practice, or how it would provide a noticeable benefit in practice. Since the cost of removing a code path is negligible, I assume you're talking about exception frames, but I still don't see how this applies. Could you elaborate, or is this improvement a theory for now? Besides, won't A's output be a range of dchar, so B, C and D will not autodecode with or without this change? --
