On Sat, Oct 23, 2010 at 8:33 PM, Terry Laurenzo <t...@laurenzo.org> wrote: >> >> I'm still going to write up a proposed grammar that takes these items into >> account - just ran out of time tonight. >> > > The binary format I was thinking of is here: > > http://github.com/tlaurenzo/pgjson/blob/master/pgjson/shared/include/json/jsonbinary.h > This was just a quick brain dump and I haven't done a lot of diligence on > verifying it, but I think it should be more compact than most JSON text > payloads and quick to iterate over/update in sibling traversal order vs > depth-first traversal which is what we would get out of JSON text. > Thoughts? > Terry
It doesn't do particularly well on my previous example of [1,2,3]. It comes out slightly shorter on ["a","b","c"] and better if the strings need any escaping. I don't think the float4 and float8 formats are very useful; how could you be sure that the output was going to look the same as the input? Or alternatively that dumping a particular object to text and reloading it will produce the same internal representation? I think it would be simpler to represent integers using a string of digits; that way you can be assured of going from text -> binary -> text without change. Perhaps it would be enough to define the high two bits as follows: 00 = array, 01 = object, 10 = string, 11 = number/true/false/null. The next 2 bits specify how the length is stored. 00 = remaining 4 bits store a length of up to 15 bytes, 01 = remaining 4 bits + 1 additional byte store a 12-bit length of up to 4K, 10 = remaining 4 bits + 2 additional bytes store a 20-bit length of up to 1MB, 11 = 4 additional bytes store a full 32-bit length word. Then, the array, object, and string representations can work as you've specified them. Anything else can be represented by itself, or perhaps we should say that numbers represent themselves and true/false/null are represented by a 1-byte sequence, t/f/n (or perhaps we could define 111100{00,01,10} to mean those values, since there's no obvious reason for the low bits to be non-zero if a 4-bit length word ostensibly follows). So [1,2,3] = 06 C1 '1' C1 '2' C1 '3' and ["a","b","c"] = 06 81 'a' 81 'b' 81 'c' (I am still worried about the serialization/deserialization overhead but that's a different issue.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers