Am 25.08.2015 um 07:55 schrieb Martin Nowak:
On Saturday, 22 August 2015 at 13:41:49 UTC, Sönke Ludwig wrote:
There is more than the actual call to validate(), such as writing
tests and making sure the surroundings work, adjusting the interface
and writing documentation. It's not *that* much work, but nonetheless
wasted work.

I also still think that this hasn't been a bad idea at all. Because it
speeds up the most important use case, parsing JSON from a non-memory
source that has not yet been validated. I also very much like the idea
of making it a programming error to have invalid UTF stored in a
string, i.e. forcing the validation to happen before the cast from
bytes to chars.

Also see "utf/unicode should only be validated once"
https://issues.dlang.org/show_bug.cgi?id=14919

If combining lexing and validation is faster (why?) then a ubyte
consuming interface should be available, though why couldn't it be done
by adding a lazy ubyte->char validator range to std.utf.
In any case during lexing we should avoid autodecoding of narrow strings
for redundant validation.

The performance benefit comes from the fact that almost all of JSON is a subset of ASCII, so that lexing the input will implicitly validate it as correct UTF. The only places where actual UTF sequences can occur is in string literals outside of escape sequences. Depending on the type of document, that can result is a lot less conditionals compared to a full validation of the input.

Autodecoding during lexing is being avoided, everything happens on the code unit level.

Reply via email to