On Thursday, September 29, 2011, Riyad Kalla <[email protected]> wrote: > DISCLAIMER: This looks long, but reads quickly (I hope). If you are in a > rush, > just check the last 2 sections and see if it sounds interesting. > > > Hi everybody. I am new to the list, but a big fan of Couch and I have been > working > on something I wanted to share with the group. > > My appologies if this isn't the right venue or list ediquette... I wasn't > really > sure where to start with this conversation. > > > Background > ===================== > With the help of the JSON spec community I've been finalizing a universal, > binary JSON format specification that offers 1:1 compatibility with JSON. > > The full spec is here (http://ubjson.org/) and the quick list of types is > here > (http://ubjson.org/type-reference/). Differences with existing specs and > "Why" are > all addressed on the site in the first few sections. > > The goal of the specification was first to maintain 1:1 compatibility with > JSON > (no custom data structures - like what caused BSON to be rejected in Issue > #702), > secondly to be as simple to work with as regular JSON (no complex data > structures or > encoding/decoding algorithms to implement) and lastly, it had to be smaller > than > compacted JSON and faster to generate and parse. > > Using a test doc that I see Filipe reference in a few of his issues > (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following > compression: > > * Compacted JSON: 3,861 bytes > * Univ. Binary JSON: 3,056 bytes (20% smaller) > > In some other sample data (e.g. jvm-serializers sample data) I see a 27% > compression > with a typical compression range of 20-30%. > > While these compression levels are average, the data is written out in an > unmolested > format that is optimized for read speed (no scanning for null terminators) > and criminally > simple to work with. (win-win) > > I added more clarifying information about compression characteristics in the > "Size Requirements" > section of the spec for anyone interested. > > > Motivation > ====================== > I've been following the discussions surround a native binary JSON format for > the core > CouchDB file (Issue #1092) which transformed into keeping the format and > utilizing > Google's Snappy (Issue #1120) to provide what looks to be roughly a 40-50% > reduction in file > size at the cost of running the compression/decompression on every > read/write. > > I realize in light of the HTTP transport and JSON encoding/decoding cycle in > CouchDB, the > Snappy compression cycles are a very small part of the total time the server > spends working. > > I found this all interesting, but like I said, I realized up to this point > that Snappy > wasn't any form of bottleneck and the big compression wins server side were > great so I had > nothing to contribute to the conversation. > > > Catalyst > ====================== > This past week I watched Tim Anglade's presentation (http://goo.gl/LLucD) > and started to > foam at the mouth when I saw his slides where he skipped the JSON > encode/decode cycle > server-side and just generated straight from binary on disk into MessagePack > and got > some phenomenal speedups from the server: > http://i.imgscalr.com/XKqXiLusT.png > > I pinged Tim to see what the chances of adding Univ Binary JSON support was > and he seemed > ameanable to the idea as long as I could hand him an Erlang or Ruby impl > (unfortunately, > I am not familiar with either). > > > ah-HA! moment > ====================== > Today it occurred to me that if CouchDB were able to (at the cost of 20% > more disk space > than it is using with Snappy enabled, but still 20% *less* than before > Snappy was integrated) > use the Universal Binary JSON format as its native storage format AND > support for serving replies > using the same format was added (a-la Tim's work), this would allow CouchDB > to (theoretically) > reply to queries by pulling bytes off disk (or memory) and immediately > streaming them back to > the caller with no intermediary step at all (no Snappy decompress, no Erlang > decode, no JSON encode). > > Given that the Univ Binary JSON spec is standard, easy to parse and simple > to convert back to > JSON, adding support for it seemed more consistent with Couch's motto of > ease and simplicity > than say MessagePack or Protobuff which provide better compression but at > the cost of more > complex formats and data types that have no ancillary in JSON. > > I don't know the intracacies of Couch's internals, if that is wrong and some > Erlang > manipulation of the data would still be required, I believe it would still > be faster to pull the data > off disk in the Univ Binary JSON format, decode to Erlang native types and > then reply while > skipping the Snappy decompression step. > > If it *would* be possible to stream it back un-touched directly from disk, > that seems like > an enhancement that could potentially speed up CouchDB by at least an order > of magnitude. > > > Conclusion > ======================= > I would appreciate any feedback on this idea from you guys with a lot more > knowledge of > the internals. > > I have no problem if this is a horrible idea and never going to happen, I > just wanted to try > and contribute something back. > > > Thank you all for reading. > > Best wishes, > Riyad >
what is universal in something new? - benoit
