Re: Handling invalid UTF sequences

Brad Anderson Thu, 20 Mar 2014 16:36:58 -0700

On Thursday, 20 March 2014 at 22:51:27 UTC, monarch_dodra wrote:

On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
Currently we do it by throwing a UTFException. This hasproblems:
1. about anything that deals with UTF cannot be made nothrow
2. turns innocuous errors into major problems, such as DOSattack vectors
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

One option to fix this is to treat invalid sequences as:

1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

2. U+FFFD

I kinda like option 1.

What do you think?
I had thought of this before, and had an idea along the linesof:
1. strings "inside" the program are always valid.
2. encountering invalid strings "inside" the program is anError.3. strings from the "outside" world must be validated beforeuse.
The advantage is *more* than just a nothrow guarantee, but alsoa performance guarantee in release. And it *is* a pretty saneapproach to the problem:
- User data: validate before use.
- Internal data: if its bad, your program is in a failure state.

I'm a fan of this approach but Timon pointed out when I wroteabout it once that it's rather trivial to get an invalid stringthrough slicing mid-code point so now I'm not so sure. I thinkI'm still in favor of it because you've obviously got a logicerror if that happens so your program isn't correct anyway (it'snot a matter of bad user input).

Re: Handling invalid UTF sequences

Reply via email to