Behavior of strings with invalid unicode...

monarch_dodra Wed, 21 Nov 2012 05:30:28 -0800

I made a commit that was meant to better certify what functionsthrew in UTF.

I thus noticed that some of our functions, are unsafe. Forexample:


strings s = [0b1100_0000]; //1 byte of 2 byte sequence
s.popFront();              //Assertion error because of invalid
                           //slicing of s[2 .. $];

"pop" is nothrow, so throwing exception is out of the question,and the implementation seems to imply that "invalid unicodesequences are removed".


This is a bug, right?

--------

Things get more complicated if you take into account "partialinvalidity". For example:


strings s = [0b1100_0000, 'a', 'b'];

Here, the first byte is actually an invalid sequence, since thesecond byte is not of the form 0b10XX_XXXX. What's more, byte 2itself is actually a valid sequence. We do not detect thisthough, and create this output:

s.popFront(); => s == "b";
*arguably*, the correct behavior would be:
s.popFront(); => s == "ab";
Where only the single invalid first byte is removed.

The problem is that doing this would actually be much moreexpensive, especially for a rare case. Worst yet, chances are youvalidate again, and again (and again) the same character.


--------
So here are my 2 questions:

1. Is there, or does anyone know of, a standardized "behavior tofollow when decoding utf with invalid codes"?

2. Do we even really support invalid UTF after we "leave" thestd.utf.decode layer? EG: We simply suppose that the string isvalid?

Behavior of strings with invalid unicode...

Reply via email to