Re: Universal Binary JSON in CouchDB

Benoit Chesneau Wed, 28 Sep 2011 22:27:19 -0700

On Thursday, September 29, 2011, Riyad Kalla <[email protected]> wrote:
> DISCLAIMER: This looks long, but reads quickly (I hope). If you are in a
> rush,
> just check the last 2 sections and see if it sounds interesting.
>
>
> Hi everybody. I am new to the list, but a big fan of Couch and I have been
> working
> on something I wanted to share with the group.
>
> My appologies if this isn't the right venue or list ediquette... I wasn't
> really
> sure where to start with this conversation.
>
>
> Background
> =====================
> With the help of the JSON spec community I've been finalizing a universal,
> binary JSON format specification that offers 1:1 compatibility with JSON.
>
> The full spec is here (http://ubjson.org/) and the quick list of types is
> here
> (http://ubjson.org/type-reference/). Differences with existing specs and
> "Why" are
> all addressed on the site in the first few sections.
>
> The goal of the specification was first to maintain 1:1 compatibility with
> JSON
> (no custom data structures - like what caused BSON to be rejected in Issue
> #702),
> secondly to be as simple to work with as regular JSON (no complex data
> structures or
> encoding/decoding algorithms to implement) and lastly, it had to be
smaller
> than
> compacted JSON and faster to generate and parse.
>
> Using a test doc that I see Filipe reference in a few of his issues
> (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following
> compression:
>
> * Compacted JSON: 3,861 bytes
> * Univ. Binary JSON: 3,056 bytes (20% smaller)
>
> In some other sample data (e.g. jvm-serializers sample data) I see a 27%
> compression
> with a typical compression range of 20-30%.
>
> While these compression levels are average, the data is written out in an
> unmolested
> format that is optimized for read speed (no scanning for null terminators)
> and criminally
> simple to work with. (win-win)
>
> I added more clarifying information about compression characteristics in
the
> "Size Requirements"
> section of the spec for anyone interested.
>
>
> Motivation
> ======================
> I've been following the discussions surround a native binary JSON format
for
> the core
> CouchDB file (Issue #1092) which transformed into keeping the format and
> utilizing
> Google's Snappy (Issue #1120) to provide what looks to be roughly a 40-50%
> reduction in file
> size at the cost of running the compression/decompression on every
> read/write.
>
> I realize in light of the HTTP transport and JSON encoding/decoding cycle
in
> CouchDB, the
> Snappy compression cycles are a very small part of the total time the
server
> spends working.
>
> I found this all interesting, but like I said, I realized up to this point
> that Snappy
> wasn't any form of bottleneck and the big compression wins server side
were
> great so I had
> nothing to contribute to the conversation.
>
>
> Catalyst
> ======================
> This past week I watched Tim Anglade's presentation (http://goo.gl/LLucD)
> and started to
> foam at the mouth when I saw his slides where he skipped the JSON
> encode/decode cycle
> server-side and just generated straight from binary on disk into
MessagePack
> and got
> some phenomenal speedups from the server:
> http://i.imgscalr.com/XKqXiLusT.png
>
> I pinged Tim to see what the chances of adding Univ Binary JSON support
was
> and he seemed
> ameanable to the idea as long as I could hand him an Erlang or Ruby impl
> (unfortunately,
> I am not familiar with either).
>
>
> ah-HA! moment
> ======================
> Today it occurred to me that if CouchDB were able to (at the cost of 20%
> more disk space
> than it is using with Snappy enabled, but still 20% *less* than before
> Snappy was integrated)
> use the Universal Binary JSON format as its native storage format AND
> support for serving replies
> using the same format was added (a-la Tim's work), this would allow
CouchDB
> to (theoretically)
> reply to queries by pulling bytes off disk (or memory) and immediately
> streaming them back to
> the caller with no intermediary step at all (no Snappy decompress, no
Erlang
> decode, no JSON encode).
>
> Given that the Univ Binary JSON spec is standard, easy to parse and simple
> to convert back to
> JSON, adding support for it seemed more consistent with Couch's motto of
> ease and simplicity
> than say MessagePack or Protobuff which provide better compression but at
> the cost of more
> complex formats and data types that have no ancillary in JSON.
>
> I don't know the intracacies of Couch's internals, if that is wrong and
some
> Erlang
> manipulation of the data would still be required, I believe it would still
> be faster to pull the data
> off disk in the Univ Binary JSON format, decode to Erlang native types and
> then reply while
> skipping the Snappy decompression step.
>
> If it *would* be possible to stream it back un-touched directly from disk,
> that seems like
> an enhancement that could potentially speed up CouchDB by at least an
order
> of magnitude.
>
>
> Conclusion
> =======================
> I would appreciate any feedback on this idea from you guys with a lot more
> knowledge of
> the internals.
>
> I have no problem if this is a horrible idea and never going to happen, I
> just wanted to try
> and contribute something back.
>
>
> Thank you all for reading.
>
> Best wishes,
> Riyad
>


what is universal in something new?

-  benoit

Re: Universal Binary JSON in CouchDB

Reply via email to