Am 26.08.2014 10:24, schrieb "Ola Fosheim Grøstad" <[email protected]>":
On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote:
That's true. So the ideal solution would be to *assume* UTF-8 when the
input is char based and to *validate* if the input is "numeric".

I think you should validate JSON-strings to be UTF-8 encoded even if you
allow illegal unicode values. Basically ensuring that >0x7f has the
right number of bytes after it, so you don't get >0x7f as the last byte
in a string etc.

I think this is a misunderstanding. What I mean is that if the input range passed to the lexer is char/wchar/dchar based, the lexer should assume that the input is well formed UTF. After all this is how D strings are defined.

When on the other hand a ubyte/ushort/uint range is used, the lexer should validate all string literals.


Well, that's something that's definitely out of the scope of this
proposal. Definitely an interesting direction to pursue, though.

Maybe the interface/code structure is or could be designed so that the
implementation could later be version()'ed to SIMD where possible.

I guess that shouldn't be an issue. From the outside it's just a generic range that is passed in and internally it's always possible to add special cases for array inputs. If someone else wants to play around with this idea, we could of course also integrate it right away, it's just that I personally don't have the time to go to the extreme here.

You cannot assume \u… to be valid if you convert it.

I meant "X" to stand for a hex digit. The point was just that you
don't have to worry about interacting in a bad way with UTF sequences
when you find "\uXXXX".

When you convert "\uXXXX" to UTF-8 bytes, is it then validated as a
legal code point? I guess it is not necessary.

What is validated is that it forms valid UTF-16 surrogate pairs, and those are converted to a single dchar instead (if applicable). This is necessary, because otherwise the lexer would produce invalid UTF-8 for valid inputs. Apart from that, the value is used verbatim as a dchar.


Btw, I believe rapidJSON achieves high speed by converting strings in
situ, so that if the prefix is escape free it just converts in place
when it hits the first escape. Thus avoiding some moving.

The same is true for this lexer, at least for array inputs. It actually currently just stores a slice of the string literal in all cases and lazily decodes on the first access. While doing that, it first skips any escape sequence free prefix and returns a slice if the whole string is escape sequence free.

Reply via email to