On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu
wrote:
Here's a baseline: http://goo.gl/91vIGc. Destroy!
Andrei
Before we roll this out, could we discuss a strategy/guideline
in regards to detecting and handling invalid UTF sequences?
Having a fast "front" is fine and all, but if it means your
program asserting in release (or worst, silently corrupting
memory) just because the client was trying to read a bad text
file, I'm unsure this is acceptable.
I would strongly advise to at least offer an option, possibly via
a template parameter, for turning error handling on or off,
similar to how Python handles decoding. Examples below in Python
3.
b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
b"\255".decode("utf-8", errors="replace") # replacement character
used
b"\255".decode("utf-8", errors="ignore") # Empty string, invalid
sequence removed.
All three strategies are useful from time to time. I mainly reach
for option three when I'm trying to get some text data out of
some old broken databases or similar.
We may consider leaving the error checking on in -release for the
'strict' decoding, but throwing an Error instead of an exception
so the function can be nothrow. This would prevent memory
corruption in release code. assert vs throw Error is up for
debate.