Re: Document storage

Drew Kutcharian Thu, 29 Mar 2012 11:31:29 -0700

Yes, I meant the "row header index". What I have done is that I'm storing an 
object (i.e. UserProfile) where you read or write it as a whole (a user updates 
their user details in a single page in the UI). So I serialize that object into 
a binary JSON using SMILE format. I then compress it using Snappy on the client 
side. So as far as Cassandra cares it's storing a byte[].

Now on the client side, I'm using cassandra-cli with a custom type that knows 
how to turn a byte[] into a JSON text and back. The only issue was 
CASSANDRA-4081 where "assume" doesn't work with custom types. If CASSANDRA-4081 
gets fixed, I'll get the best of both worlds.

Also advantages of this vs. the thrift based Super Column families are:

1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize 
and compression/decompression happens on the client nodes where there is plenty 
idle CPU time

2. Saving network bandwidth since I'm sending over a compressed byte[]

-- Drew

On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

> On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian <d...@venarc.com> wrote:
>>> I think this is a much better approach because that gives you the
>>> ability to update or retrieve just parts of objects efficiently,
>>> rather than making column values just blobs with a bunch of special
>>> case logic to introspect them.  Which feels like a big step backwards
>>> to me.
>> 
>> Unless your access pattern involves reading/writing the whole document each 
>> time. In that case you're better off serializing the whole document and 
>> storing it in a column as a byte[] without incurring the overhead of column 
>> indexes. Right?
> 
> Hmm, not sure what you're thinking of there.
> 
> If you mean the "index" that's part of the row header for random
> access within a row, then no, serializing to byte[] doesn't save you
> anything.
> 
> If you mean secondary indexes, don't declare any if you don't want any. :)
> 
> Just telling C* to store a byte[] *will* be slightly lighter-weight
> than giving it named columns, but we're talking negligible compared to
> the overhead of actually moving the data on or off disk in the first
> place.  Not even close to being worth giving up being able to deal
> with your data from standard tools like cqlsh, IMO.
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com

Re: Document storage

Reply via email to