I made a commit that was meant to better certify what functions threw in UTF.

I thus noticed that some of our functions, are unsafe. For example:

strings s = [0b1100_0000]; //1 byte of 2 byte sequence
s.popFront();              //Assertion error because of invalid
                           //slicing of s[2 .. $];

"pop" is nothrow, so throwing exception is out of the question, and the implementation seems to imply that "invalid unicode sequences are removed".

This is a bug, right?

--------
Things get more complicated if you take into account "partial invalidity". For example:

strings s = [0b1100_0000, 'a', 'b'];

Here, the first byte is actually an invalid sequence, since the second byte is not of the form 0b10XX_XXXX. What's more, byte 2 itself is actually a valid sequence. We do not detect this though, and create this output:
s.popFront(); => s == "b";
*arguably*, the correct behavior would be:
s.popFront(); => s == "ab";
Where only the single invalid first byte is removed.

The problem is that doing this would actually be much more expensive, especially for a rare case. Worst yet, chances are you validate again, and again (and again) the same character.

--------
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to follow when decoding utf with invalid codes"?

2. Do we even really support invalid UTF after we "leave" the std.utf.decode layer? EG: We simply suppose that the string is valid?

Reply via email to