This. And decoded JSON strings are always smaller than encoded strings--JSON uses escaping to encode non UTF-8 stuff, so in the case where someone sends a surrogate pair (legal in JSON) it's encoded as \u0000\u0000. In short, it's absolutely possible to create a pull parser that never allocates, even for decoding. As proof, I've done it before. :-p
On Feb 9, 2012, at 3:07 AM, Johannes Pfau <[email protected]> wrote: > Am Wed, 08 Feb 2012 20:49:48 -0600 > schrieb "Robert Jacques" <[email protected]>: > >> On Wed, 08 Feb 2012 02:12:57 -0600, Johannes Pfau >> <[email protected]> wrote: >>> Am Tue, 07 Feb 2012 20:44:08 -0500 >>> schrieb "Jonathan M Davis" <[email protected]>: >>>> On Tuesday, February 07, 2012 00:56:40 Adam D. Ruppe wrote: >>>>> On Monday, 6 February 2012 at 23:47:08 UTC, Jonathan M Davis >> [snip] >>> >>> Using ranges of dchar directly can be horribly inefficient in some >>> cases, you'll need at least some kind off buffered dchar range. Some >>> std.json replacement code tried to use only dchar ranges and had to >>> reassemble strings character by character using Appender. That sucks >>> especially if you're only interested in a small part of the data and >>> don't care about the rest. >>> So for pull/sax parsers: Use buffering, return strings(better: >>> w/d/char[]) as slices to that buffer. If the user needs to keep a >>> string, he can still copy it. (String decoding should also be done >>> on-demand only). >> >> Speaking as the one proposing said Json replacement, I'd like to >> point out that JSON strings != UTF strings: manual conversion is >> required some of the time. And I use appender as a dynamic buffer in >> exactly the manner you suggest. There's even an option to use a >> string cache to minimize total memory usage. (Hmm... that >> functionality should probably be re-factored out and made into its >> own utility) That said, I do end up doing a bunch of useless encodes >> and decodes, so I'm going to special case those away and add slicing >> support for strings. wstrings and dstring will still need to be >> converted as currently Json values only accept strings and therefore >> also Json tokens only support strings. As a potential user of the >> sax/pull interface would you prefer the extra clutter of special side >> channels for zero-copy wstrings and dstrings? > > Regarding wstrings and dstrings: We'll JSON seems to be UTF8 in almost > all cases, so it's not that important. But i think it should be > possible to use templates to implement identical parsers for d/w/strings > > Regarding the use of Appender: Long text ahead ;-) > > I think pull parsers should really be as fast a possible and low-level. > For easy to use highlevel stuff there's always DOM and a safe, > high-level serialization API should be implemented based on the > PullParser as well. The serialization API would read only the requested > data, skipping the rest: > ---------------- > struct Data > { > string link; > } > auto Data = unserialize!Data(json); > ---------------- > > So in the PullParser we should > avoid memory allocation whenever possible, I think we can even avoid it > completely: > > I think dchar ranges are just the wrong input type for parsers, parsers > should use buffered ranges or streams (which would be basically the > same). We could use a generic BufferedRange with real > dchar-ranges then. This BufferedRange could use a static buffer, so > there's no need to allocate anything. > > The pull parser should return slices to the original string (if the > input is a string) or slices to the Range/Stream's buffer. > Of course, such a slice is only valid till the pull parser is called > again. The slice also wouldn't be decoded yet. And a slice string could > only be as long as the buffer, but I don't think this is an issue, a > 512KB buffer can already store 524288 characters. > > If the user wants to keep a string, he should really do > decodeJSONString(data).idup. There's a little more opportunity for > optimization: As long as a decoded json string is always smaller than > the encoded one(I don't know if it is), we could have a decodeJSONString > function which overwrites the original buffer --> no memory allocation. > > If that's not the case, decodeJSONString has to allocate iff the > decoded string is different. So we need a function which always returns > the decoded string as a safe too keep copy and a function which returns > the decoded string as a slice if the decoded string is > the same as the original. > > An example: string json = > { > "link":"http://www.google.com", > "useless_data":"lorem ipsum", > "more":{ > "not interested":"yes" > } > } > > now I'm only interested in the link. I should be possible to parse that > with zero memory allocations: > > auto parser = Parser(json); > parser.popFront(); > while(!parser.empty) > { > if(parser.front.type == KEY > && tempDecodeJSON(parser.front.value) == "link") > { > parser.popFront(); > assert(!parser.empty && parser.front.type == VALUE); > return decodeJSON(parser.front.value); //Should return a slice > } > //Skip everything else; > parser.popFront(); > } > > tempDecodeJSON returns a decoded string, which (usually) isn't safe to > store(it can/should be a slice to the internal buffer, here it's a > slice to the original string, so it could be stored, but there's no > guarantee). In this case, the call to tempDecodeJSON could even be left > out, as we only search for "link" wich doesn't need encoding.
