On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences?

Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.

I would strongly advise to at least offer an option, possibly via a template parameter, for turning error handling on or off, similar to how Python handles decoding. Examples below in Python 3.

b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
b"\255".decode("utf-8", errors="replace") # replacement character used b"\255".decode("utf-8", errors="ignore") # Empty string, invalid sequence removed.

All three strategies are useful from time to time. I mainly reach for option three when I'm trying to get some text data out of some old broken databases or similar.

We may consider leaving the error checking on in -release for the 'strict' decoding, but throwing an Error instead of an exception so the function can be nothrow. This would prevent memory corruption in release code. assert vs throw Error is up for debate.

Reply via email to