On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
But why should UTF validation be the job of the lexer in the first place?

Because you want to save time, it is faster to integrate validation? The most likely use scenario is to receive REST data over HTTP that needs validation.

Well, so then I agree with Andrei… array of bytes it is. ;-)

added as a separate proxy range. But if we end up going for validating in the lexer, it would indeed be enough to validate inside strings, because the rest of the grammar assumes a subset of ASCII.

Not assumes, but defines! :-)

If you have to validate UTF before lexing then you will end up needlessly scanning lots of ascii if the file contains lots of non-strings or is from a encoder that only sends pure ascii.

If you want to have "plugin" validation of strings then you also need to differentiate strings so that the user can select which data should be just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing double validation (you have to bypass >7F followed by string-end anyway).

The advantage of integrated validation is that you can use 16 bytes SIMD registers on the buffer.

I presume you can load 16 bytes and do BITWISE-AND on the MSB, then match against string-end and carefully use this to boost performance of simultanous UTF validation, escape-scanning, and string-end scan. A bit tricky, of course.

At least no UTF validation is needed. Since all non-ASCII characters will always be composed of bytes >0x7F, a sequence \uXXXX can be assumed to be valid wherever in the string it occurs, and all other bytes that don't belong to an escape sequence are just passed through as-is.

You cannot assume \u… to be valid if you convert it.

Reply via email to