DISCLAIMER: This looks long, but reads quickly (I hope). If you are in a rush, just check the last 2 sections and see if it sounds interesting.
Hi everybody. I am new to the list, but a big fan of Couch and I have been working on something I wanted to share with the group. My appologies if this isn't the right venue or list ediquette... I wasn't really sure where to start with this conversation. Background ===================== With the help of the JSON spec community I've been finalizing a universal, binary JSON format specification that offers 1:1 compatibility with JSON. The full spec is here (http://ubjson.org/) and the quick list of types is here (http://ubjson.org/type-reference/). Differences with existing specs and "Why" are all addressed on the site in the first few sections. The goal of the specification was first to maintain 1:1 compatibility with JSON (no custom data structures - like what caused BSON to be rejected in Issue #702), secondly to be as simple to work with as regular JSON (no complex data structures or encoding/decoding algorithms to implement) and lastly, it had to be smaller than compacted JSON and faster to generate and parse. Using a test doc that I see Filipe reference in a few of his issues (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following compression: * Compacted JSON: 3,861 bytes * Univ. Binary JSON: 3,056 bytes (20% smaller) In some other sample data (e.g. jvm-serializers sample data) I see a 27% compression with a typical compression range of 20-30%. While these compression levels are average, the data is written out in an unmolested format that is optimized for read speed (no scanning for null terminators) and criminally simple to work with. (win-win) I added more clarifying information about compression characteristics in the "Size Requirements" section of the spec for anyone interested. Motivation ====================== I've been following the discussions surround a native binary JSON format for the core CouchDB file (Issue #1092) which transformed into keeping the format and utilizing Google's Snappy (Issue #1120) to provide what looks to be roughly a 40-50% reduction in file size at the cost of running the compression/decompression on every read/write. I realize in light of the HTTP transport and JSON encoding/decoding cycle in CouchDB, the Snappy compression cycles are a very small part of the total time the server spends working. I found this all interesting, but like I said, I realized up to this point that Snappy wasn't any form of bottleneck and the big compression wins server side were great so I had nothing to contribute to the conversation. Catalyst ====================== This past week I watched Tim Anglade's presentation (http://goo.gl/LLucD) and started to foam at the mouth when I saw his slides where he skipped the JSON encode/decode cycle server-side and just generated straight from binary on disk into MessagePack and got some phenomenal speedups from the server: http://i.imgscalr.com/XKqXiLusT.png I pinged Tim to see what the chances of adding Univ Binary JSON support was and he seemed ameanable to the idea as long as I could hand him an Erlang or Ruby impl (unfortunately, I am not familiar with either). ah-HA! moment ====================== Today it occurred to me that if CouchDB were able to (at the cost of 20% more disk space than it is using with Snappy enabled, but still 20% *less* than before Snappy was integrated) use the Universal Binary JSON format as its native storage format AND support for serving replies using the same format was added (a-la Tim's work), this would allow CouchDB to (theoretically) reply to queries by pulling bytes off disk (or memory) and immediately streaming them back to the caller with no intermediary step at all (no Snappy decompress, no Erlang decode, no JSON encode). Given that the Univ Binary JSON spec is standard, easy to parse and simple to convert back to JSON, adding support for it seemed more consistent with Couch's motto of ease and simplicity than say MessagePack or Protobuff which provide better compression but at the cost of more complex formats and data types that have no ancillary in JSON. I don't know the intracacies of Couch's internals, if that is wrong and some Erlang manipulation of the data would still be required, I believe it would still be faster to pull the data off disk in the Univ Binary JSON format, decode to Erlang native types and then reply while skipping the Snappy decompression step. If it *would* be possible to stream it back un-touched directly from disk, that seems like an enhancement that could potentially speed up CouchDB by at least an order of magnitude. Conclusion ======================= I would appreciate any feedback on this idea from you guys with a lot more knowledge of the internals. I have no problem if this is a horrible idea and never going to happen, I just wanted to try and contribute something back. Thank you all for reading. Best wishes, Riyad
