Universal Binary JSON in CouchDB

Riyad Kalla Wed, 28 Sep 2011 19:59:14 -0700

DISCLAIMER: This looks long, but reads quickly (I hope). If you are in a
rush,
just check the last 2 sections and see if it sounds interesting.



Hi everybody. I am new to the list, but a big fan of Couch and I have been
working
on something I wanted to share with the group.

My appologies if this isn't the right venue or list ediquette... I wasn't
really
sure where to start with this conversation.


Background
=====================
With the help of the JSON spec community I've been finalizing a universal,
binary JSON format specification that offers 1:1 compatibility with JSON.

The full spec is here (http://ubjson.org/) and the quick list of types is
here
(http://ubjson.org/type-reference/). Differences with existing specs and
"Why" are
all addressed on the site in the first few sections.

The goal of the specification was first to maintain 1:1 compatibility with
JSON
(no custom data structures - like what caused BSON to be rejected in Issue
#702),
secondly to be as simple to work with as regular JSON (no complex data
structures or
encoding/decoding algorithms to implement) and lastly, it had to be smaller
than
compacted JSON and faster to generate and parse.

Using a test doc that I see Filipe reference in a few of his issues
(http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following
compression:

* Compacted JSON: 3,861 bytes
* Univ. Binary JSON: 3,056 bytes (20% smaller)

In some other sample data (e.g. jvm-serializers sample data) I see a 27%
compression
with a typical compression range of 20-30%.

While these compression levels are average, the data is written out in an
unmolested
format that is optimized for read speed (no scanning for null terminators)
and criminally
simple to work with. (win-win)

I added more clarifying information about compression characteristics in the
"Size Requirements"
section of the spec for anyone interested.


Motivation
======================
I've been following the discussions surround a native binary JSON format for
the core
CouchDB file (Issue #1092) which transformed into keeping the format and
utilizing
Google's Snappy (Issue #1120) to provide what looks to be roughly a 40-50%
reduction in file
size at the cost of running the compression/decompression on every
read/write.

I realize in light of the HTTP transport and JSON encoding/decoding cycle in
CouchDB, the
Snappy compression cycles are a very small part of the total time the server
spends working.

I found this all interesting, but like I said, I realized up to this point
that Snappy
wasn't any form of bottleneck and the big compression wins server side were
great so I had
nothing to contribute to the conversation.


Catalyst
======================
This past week I watched Tim Anglade's presentation (http://goo.gl/LLucD)
and started to
foam at the mouth when I saw his slides where he skipped the JSON
encode/decode cycle
server-side and just generated straight from binary on disk into MessagePack
and got
some phenomenal speedups from the server:
http://i.imgscalr.com/XKqXiLusT.png

I pinged Tim to see what the chances of adding Univ Binary JSON support was
and he seemed
ameanable to the idea as long as I could hand him an Erlang or Ruby impl
(unfortunately,
I am not familiar with either).


ah-HA! moment
======================
Today it occurred to me that if CouchDB were able to (at the cost of 20%
more disk space
than it is using with Snappy enabled, but still 20% *less* than before
Snappy was integrated)
use the Universal Binary JSON format as its native storage format AND
support for serving replies
using the same format was added (a-la Tim's work), this would allow CouchDB
to (theoretically)
reply to queries by pulling bytes off disk (or memory) and immediately
streaming them back to
the caller with no intermediary step at all (no Snappy decompress, no Erlang
decode, no JSON encode).

Given that the Univ Binary JSON spec is standard, easy to parse and simple
to convert back to
JSON, adding support for it seemed more consistent with Couch's motto of
ease and simplicity
than say MessagePack or Protobuff which provide better compression but at
the cost of more
complex formats and data types that have no ancillary in JSON.

I don't know the intracacies of Couch's internals, if that is wrong and some
Erlang
manipulation of the data would still be required, I believe it would still
be faster to pull the data
off disk in the Univ Binary JSON format, decode to Erlang native types and
then reply while
skipping the Snappy decompression step.

If it *would* be possible to stream it back un-touched directly from disk,
that seems like
an enhancement that could potentially speed up CouchDB by at least an order
of magnitude.


Conclusion
=======================
I would appreciate any feedback on this idea from you guys with a lot more
knowledge of
the internals.

I have no problem if this is a horrible idea and never going to happen, I
just wanted to try
and contribute something back.


Thank you all for reading.

Best wishes,
Riyad

Universal Binary JSON in CouchDB

Reply via email to