On Wednesday, 21 November 2012 at 18:25:56 UTC, Jonathan M Davis wrote:
On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to
follow when decoding utf with invalid codes"?

2. Do we even really support invalid UTF after we "leave" the
std.utf.decode layer? EG: We simply suppose that the string is
valid?

We don't support invalid unicode being providing ways to check for it and in some cases throwing if it's encountered. If you create a string with invalid unicode, then you're shooting yourself in the foot, and you could get weird results. Some code checks for validity and will throw when it's given invalid unicode (decode in particular does this), whereas some code will simply ignore the fact that it's invalid and move on (generally, because it's not bothering to go to the effort of validating it). I believe that at the moment, the idea is that when the full decoding of a character occurs, a UTFException will be thrown if an invalid code point is encountered, whereas anything which partially decodes characters (e.g. just figures out how large a code point is) may or may not throw. popFront used to throw but doesn't any longer in an effort to make it faster, letting decode be the one to throw (so front would
still throw, but popFront wouldn't).

OK: I guess that makes sense. I kind of which there'd be more of a documented "two-level" scheme, but that should be fine.

I'm not aware of there being any standard way to deal with invalid Unicode, but I believe that popFront currently just treats invalid code points as being
of length 1.

- Jonathan M Davis

Well, popFront only pops 1 element only if the very first element of is an invalid code point, but will not "see" if the code point at index 2 is invalid for multi-byte codes.

This kind of gives it a double-standard behavior, but I guess we have to draw a line somewhere.

Reply via email to