Re: Behavior of strings with invalid unicode...

monarch_dodra Sun, 25 Nov 2012 23:50:35 -0800

On Wednesday, 21 November 2012 at 18:25:56 UTC, Jonathan M Daviswrote:

On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behaviorto
follow when decoding utf with invalid codes"?
2. Do we even really support invalid UTF after we "leave" the
std.utf.decode layer? EG: We simply suppose that the string is
valid?
We don't support invalid unicode being providing ways to checkfor it and insome cases throwing if it's encountered. If you create a stringwith invalidunicode, then you're shooting yourself in the foot, and youcould get weirdresults. Some code checks for validity and will throw when it'sgiven invalidunicode (decode in particular does this), whereas some codewill simply ignorethe fact that it's invalid and move on (generally, because it'snot botheringto go to the effort of validating it). I believe that at themoment, the ideais that when the full decoding of a character occurs, aUTFException will bethrown if an invalid code point is encountered, whereasanything whichpartially decodes characters (e.g. just figures out how large acode point is)may or may not throw. popFront used to throw but doesn't anylonger in aneffort to make it faster, letting decode be the one to throw(so front would
still throw, but popFront wouldn't).

OK: I guess that makes sense. I kind of which there'd be more ofa documented "two-level" scheme, but that should be fine.

I'm not aware of there being any standard way to deal withinvalid Unicode,but I believe that popFront currently just treats invalid codepoints as being
of length 1.

- Jonathan M Davis

Well, popFront only pops 1 element only if the very first elementof is an invalid code point, but will not "see" if the code pointat index 2 is invalid for multi-byte codes.

This kind of gives it a double-standard behavior, but I guess wehave to draw a line somewhere.

Re: Behavior of strings with invalid unicode...

Reply via email to