Re: Document storage

Drew Kutcharian Thu, 29 Mar 2012 11:05:51 -0700

Hi Ben,

Sure, there's nothing really to it, but I'll email it to you. As far as why I'm 
using Snappy on the type instead of sstable_compression is because when you set 
sstable_compression the compression happens on the Cassandra nodes and I see 
two advantages with my approach:


1. Saving extra CPU usage on the Cassandra nodes. Since 
compression/decompression can easily be done on the client nodes where there is 
plenty idle CPU time

2. Saving network bandwidth since you're sending over a compressed byte[]

One thing to note about my approach is that when I define the schema in 
Cassandra, I define the columns as byte[] and not my custom type and I do all 
the conversion on the client side.

-- Drew




On Mar 29, 2012, at 12:04 AM, Ben McCann wrote:

> Sounds awesome Drew.  Mind sharing your custom type?  I just wrote a basic
> JSON type and did the validation the same way you did, but I don't have any
> SMILE support yet.  It seems that if your type were committed to the
> Cassandra codebase then the issue you ran into of the CLI only supporting
> built-in types would no longer be a problem for you (though fixing the
> issue anyway would be good and I voted for it).  Btw, any reason you
> compress it with Snappy yourself instead of just setting sstable_compression
> to 
> SnappyCompressor<http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression>and
> letting Cassandra do that part?
> 
> -Ben
> 
> 
> On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian <d...@venarc.com> wrote:
> 
>> I'm actually doing something almost the same. I serialize my objects into
>> byte[] using Jackson's SMILE format, then compress it using Snappy then
>> store the byte[] in Cassandra. I actually created a simple Cassandra Type
>> for this but I hit a wall with cassandra-cli:
>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-4081
>> 
>> Please vote on the JIRA if you are interested.
>> 
>> Validation is pretty simple, you just need to read the value and parse it
>> using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)
>> 
>> -- Drew
>> 
>> 
>> 
>> On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:
>> 
>>> I don't imagine sort is a meaningful operation on JSON data.  As long as
>>> the sorting is consistent I would think that should be sufficient.
>>> 
>>> 
>>> On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo <edlinuxg...@gmail.com
>>> wrote:
>>> 
>>>> Some work I did stores JSON blobs in columns. The question on JSON
>>>> type is how to sort it.
>>>> 
>>>> On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
>>>> <jeremy.hanna1...@gmail.com> wrote:
>>>>> I don't speak for the project, but you might give it a day or two for
>>>> people to respond and/or perhaps create a jira ticket.  Seems like
>> that's a
>>>> reasonable data type that would get some traction - a json type.
>> However,
>>>> what would validation look like?  That's one of the main reasons there
>> are
>>>> the data types and validators, in order to validate on insert.
>>>>> 
>>>>> On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
>>>>> 
>>>>>> Any thoughts?  I'd like to submit a patch, but only if it will be
>>>> accepted.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann <b...@benmccann.com>
>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I was wondering if it would be interesting to add some type of
>>>>>>> document-oriented data type.
>>>>>>> 
>>>>>>> I've found it somewhat awkward to store document-oriented data in
>>>>>>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it,
>> and
>>>>>>> store it, but Cassandra cannot differentiate it from any other string
>>>> or
>>>>>>> byte array.  However, if my column validation_class could be a
>> JsonType
>>>>>>> that would allow tools to potentially do more interesting
>>>> introspection on
>>>>>>> the column value.  E.g. bug 3647<
>>>> https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for
>> supporting
>>>> arbitrarily nested "documents" in CQL.  Running a
>>>>>>> query against the JSON column in Pig is possible as well, but again
>> in
>>>> this
>>>>>>> use case it would be helpful to be able to encode in column metadata
>>>> that
>>>>>>> the column is stored as JSON.  For debugging, running nightly
>> reports,
>>>> etc.
>>>>>>> it would be quite useful compared to the opaque string and byte array
>>>> types
>>>>>>> we have today.  JSON is appealing because it would be easy to
>>>> implement.
>>>>>>> Something like Thrift or Protocol Buffers would actually be
>> interesting
>>>>>>> since they would be more space efficient.  However, they would also
>> be
>>>> a
>>>>>>> bit more difficult to implement because of the extra typing
>> information
>>>>>>> they provide.  I'm hoping with Cassandra 1.0's addition of
>> compression
>>>> that
>>>>>>> storing JSON is not too inefficient.
>>>>>>> 
>>>>>>> Would there be interest in adding a JsonType?  I could look at
>> putting
>>>> a
>>>>>>> patch together.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Document storage

Reply via email to