On Thursday, 20 March 2014 at 22:51:27 UTC, monarch_dodra wrote:
On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
Currently we do it by throwing a UTFException. This has
problems:
1. about anything that deals with UTF cannot be made nothrow
2. turns innocuous errors into major problems, such as DOS
attack vectors
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
One option to fix this is to treat invalid sequences as:
1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
2. U+FFFD
I kinda like option 1.
What do you think?
I had thought of this before, and had an idea along the lines
of:
1. strings "inside" the program are always valid.
2. encountering invalid strings "inside" the program is an
Error.
3. strings from the "outside" world must be validated before
use.
The advantage is *more* than just a nothrow guarantee, but also
a performance guarantee in release. And it *is* a pretty sane
approach to the problem:
- User data: validate before use.
- Internal data: if its bad, your program is in a failure state.
I'm a fan of this approach but Timon pointed out when I wrote
about it once that it's rather trivial to get an invalid string
through slicing mid-code point so now I'm not so sure. I think
I'm still in favor of it because you've obviously got a logic
error if that happens so your program isn't correct anyway (it's
not a matter of bad user input).