Re: RFC: std.json sucessor

via Digitalmars-d Mon, 25 Aug 2014 14:56:38 -0700

On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:

But why should UTF validation be the job of the lexer in thefirst place?

Because you want to save time, it is faster to integratevalidation? The most likely use scenario is to receive REST dataover HTTP that needs validation.


Well, so then I agree with Andrei… array of bytes it is. ;-)

added as a separate proxy range. But if we end up going forvalidating in the lexer, it would indeed be enough to validateinside strings, because the rest of the grammar assumes asubset of ASCII.


Not assumes, but defines! :-)

If you have to validate UTF before lexing then you will end upneedlessly scanning lots of ascii if the file contains lots ofnon-strings or is from a encoder that only sends pure ascii.

If you want to have "plugin" validation of strings then you alsoneed to differentiate strings so that the user can select whichdata should be just ascii, utf8, numbers, ids etc. Otherwise theuser will end up doing double validation (you have to bypass >7Ffollowed by string-end anyway).

The advantage of integrated validation is that you can use 16bytes SIMD registers on the buffer.

I presume you can load 16 bytes and do BITWISE-AND on the MSB,then match against string-end and carefully use this to boostperformance of simultanous UTF validation, escape-scanning, andstring-end scan. A bit tricky, of course.

At least no UTF validation is needed. Since all non-ASCIIcharacters will always be composed of bytes >0x7F, a sequence\uXXXX can be assumed to be valid wherever in the string itoccurs, and all other bytes that don't belong to an escapesequence are just passed through as-is.


You cannot assume \u… to be valid if you convert it.

Re: RFC: std.json sucessor

Reply via email to