On Thursday, 20 March 2014 at 22:51:27 UTC, monarch_dodra wrote:
On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
Currently we do it by throwing a UTFException. This has problems:

1. about anything that deals with UTF cannot be made nothrow

2. turns innocuous errors into major problems, such as DOS attack vectors
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

One option to fix this is to treat invalid sequences as:

1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

2. U+FFFD

I kinda like option 1.

What do you think?

I had thought of this before, and had an idea along the lines of:
1. strings "inside" the program are always valid.
2. encountering invalid strings "inside" the program is an Error. 3. strings from the "outside" world must be validated before use.

The advantage is *more* than just a nothrow guarantee, but also a performance guarantee in release. And it *is* a pretty sane approach to the problem:
- User data: validate before use.
- Internal data: if its bad, your program is in a failure state.


I'm a fan of this approach but Timon pointed out when I wrote about it once that it's rather trivial to get an invalid string through slicing mid-code point so now I'm not so sure. I think I'm still in favor of it because you've obviously got a logic error if that happens so your program isn't correct anyway (it's not a matter of bad user input).

Reply via email to