Re: RFC: std.json sucessor

Sönke Ludwig via Digitalmars-d Tue, 26 Aug 2014 02:07:01 -0700

Am 26.08.2014 10:24, schrieb "Ola Fosheim Grøstad"<[email protected]>":

On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote:

That's true. So the ideal solution would be to *assume* UTF-8 when the
input is char based and to *validate* if the input is "numeric".


I think you should validate JSON-strings to be UTF-8 encoded even if you
allow illegal unicode values. Basically ensuring that >0x7f has the
right number of bytes after it, so you don't get >0x7f as the last byte
in a string etc.

I think this is a misunderstanding. What I mean is that if the inputrange passed to the lexer is char/wchar/dchar based, the lexer shouldassume that the input is well formed UTF. After all this is how Dstrings are defined.

When on the other hand a ubyte/ushort/uint range is used, the lexershould validate all string literals.

Well, that's something that's definitely out of the scope of this
proposal. Definitely an interesting direction to pursue, though.


Maybe the interface/code structure is or could be designed so that the
implementation could later be version()'ed to SIMD where possible.

I guess that shouldn't be an issue. From the outside it's just a genericrange that is passed in and internally it's always possible to addspecial cases for array inputs. If someone else wants to play aroundwith this idea, we could of course also integrate it right away, it'sjust that I personally don't have the time to go to the extreme here.

You cannot assume \u… to be valid if you convert it.


I meant "X" to stand for a hex digit. The point was just that you
don't have to worry about interacting in a bad way with UTF sequences
when you find "\uXXXX".


When you convert "\uXXXX" to UTF-8 bytes, is it then validated as a
legal code point? I guess it is not necessary.

What is validated is that it forms valid UTF-16 surrogate pairs, andthose are converted to a single dchar instead (if applicable). This isnecessary, because otherwise the lexer would produce invalid UTF-8 forvalid inputs. Apart from that, the value is used verbatim as a dchar.


Btw, I believe rapidJSON achieves high speed by converting strings in
situ, so that if the prefix is escape free it just converts in place
when it hits the first escape. Thus avoiding some moving.

The same is true for this lexer, at least for array inputs. It actuallycurrently just stores a slice of the string literal in all cases andlazily decodes on the first access. While doing that, it first skips anyescape sequence free prefix and returns a slice if the whole string isescape sequence free.

Re: RFC: std.json sucessor

Reply via email to