Am 25.08.2014 22:51, schrieb "Ola Fosheim Grøstad"
<[email protected]>":
On Monday, 25 August 2014 at 20:35:32 UTC, Sönke Ludwig wrote:
BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159,
which is another argument for just letting the lexer assume valid UTF.
The lexer cannot assume valid UTF since the client might be a rogue, but
it can just bail out if the lookahead isn't jSON? So UTF-validation is
limited to strings.
But why should UTF validation be the job of the lexer in the first
place? D's "string" type is also defined to be UTF-8, so given that, it
would of course be free to assume valid UTF-8. I agree with Walter there
that validation/conversion should be added as a separate proxy range.
But if we end up going for validating in the lexer, it would indeed be
enough to validate inside strings, because the rest of the grammar
assumes a subset of ASCII.
You have to parse the strings because of the \uXXXX escapes of course,
so some basic validation is unavoidable?
At least no UTF validation is needed. Since all non-ASCII characters
will always be composed of bytes >0x7F, a sequence \uXXXX can be assumed
to be valid wherever in the string it occurs, and all other bytes that
don't belong to an escape sequence are just passed through as-is.
But I guess full validation of
string content could be another useful option along with "ignore
escapes" for the case where you want to avoid decode-encode scenarios.
(like for a proxy, or if you store pre-escaped unicode in a database)