subject:"Document storage"

I'm actually doing something almost the same. I serialize my objects into 
byte[] using Jackson's SMILE format, then compress it using Snappy then store 
the byte[] in Cassandra. I actually created a simple Cassandra Type for this 
but I hit a wall with cassandra-cli:

https://issues.apache.org/jira/browse/CASSANDRA-4081

Please vote on the JIRA if you are interested.

Validation is pretty simple, you just need to read the value and parse it using 
Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)

-- Drew



On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:

 I don't imagine sort is a meaningful operation on JSON data.  As long as
 the sorting is consistent I would think that should be sufficient.
 
 
 On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote:
 
 Some work I did stores JSON blobs in columns. The question on JSON
 type is how to sort it.
 
 On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
 jeremy.hanna1...@gmail.com wrote:
 I don't speak for the project, but you might give it a day or two for
 people to respond and/or perhaps create a jira ticket.  Seems like that's a
 reasonable data type that would get some traction - a json type.  However,
 what would validation look like?  That's one of the main reasons there are
 the data types and validators, in order to validate on insert.
 
 On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
 
 Any thoughts?  I'd like to submit a patch, but only if it will be
 accepted.
 
 Thanks,
 Ben
 
 
 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:
 
 Hi,
 
 I was wondering if it would be interesting to add some type of
 document-oriented data type.
 
 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
 store it, but Cassandra cannot differentiate it from any other string
 or
 byte array.  However, if my column validation_class could be a JsonType
 that would allow tools to potentially do more interesting
 introspection on
 the column value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting
 arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in
 this
 use case it would be helpful to be able to encode in column metadata
 that
 the column is stored as JSON.  For debugging, running nightly reports,
 etc.
 it would be quite useful compared to the opaque string and byte array
 types
 we have today.  JSON is appealing because it would be easy to
 implement.
 Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be
 a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression
 that
 storing JSON is not too inefficient.
 
 Would there be interest in adding a JsonType?  I could look at putting
 a
 patch together.
 
 Thanks,
 Ben

Sounds awesome Drew. Mind sharing your custom type? I just wrote a basic
JSON type and did the validation the same way you did, but I don't have any
SMILE support yet. It seems that if your type were committed to the
Cassandra codebase then the issue you ran into of the CLI only supporting
built-in types would no longer be a problem for you (though fixing the
issue anyway would be good and I voted for it). Btw, any reason you
compress it with Snappy yourself instead of just setting sstable_compression
to
SnappyCompressorhttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compressionand
letting Cassandra do that part?

-Ben

On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian d...@venarc.com wrote:

I'm actually doing something almost the same. I serialize my objects into
byte[] using Jackson's SMILE format, then compress it using Snappy then
store the byte[] in Cassandra. I actually created a simple Cassandra Type
for this but I hit a wall with cassandra-cli:

https://issues.apache.org/jira/browse/CASSANDRA-4081

Please vote on the JIRA if you are interested.

Validation is pretty simple, you just need to read the value and parse it
using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)

-- Drew

On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:

I don't imagine sort is a meaningful operation on JSON data. As long as
the sorting is consistent I would think that should be sufficient.

On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.com
wrote:

Some work I did stores JSON blobs in columns. The question on JSON
type is how to sort it.

On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
jeremy.hanna1...@gmail.com wrote:
I don't speak for the project, but you might give it a day or two for
people to respond and/or perhaps create a jira ticket. Seems like
that's a
reasonable data type that would get some traction - a json type.
However,
what would validation look like? That's one of the main reasons there
are
the data types and validators, in order to validate on insert.

On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:

Any thoughts? I'd like to submit a patch, but only if it will be
accepted.

Thanks,
Ben

On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com
wrote:

Hi,

I was wondering if it would be interesting to add some type of
document-oriented data type.

I've found it somewhat awkward to store document-oriented data in
Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it,
and
store it, but Cassandra cannot differentiate it from any other string
or
byte array. However, if my column validation_class could be a
JsonType
that would allow tools to potentially do more interesting
introspection on
the column value. E.g. bug 3647
https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
supporting
arbitrarily nested documents in CQL. Running a
query against the JSON column in Pig is possible as well, but again
in
this
use case it would be helpful to be able to encode in column metadata
that
the column is stored as JSON. For debugging, running nightly
reports,
etc.
it would be quite useful compared to the opaque string and byte array
types
we have today. JSON is appealing because it would be easy to
implement.
Something like Thrift or Protocol Buffers would actually be
interesting
since they would be more space efficient. However, they would also
be
a
bit more difficult to implement because of the extra typing
information
they provide. I'm hoping with Cassandra 1.0's addition of
compression
that
storing JSON is not too inefficient.

Would there be interest in adding a JsonType? I could look at
putting
a
patch together.

Thanks,
Ben

Re: Document storage

2012-03-29 Thread Jake Luciani

Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
would seem the only thing a JSON type offers you is validation.  3647 takes
it much further by deconstructing a JSON document using composite columns
to flatten the document out, with the ability to access and update portions
of the document (as well as reconstruct it).

On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in Cassandra
 today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
 Cassandra cannot differentiate it from any other string or byte array.
  However, if my column validation_class could be a JsonType that would
 allow tools to potentially do more interesting introspection on the column
 value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
  Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben




-- 
http://twitter.com/tjake

Re: Document storage

Could you explain further how I would use CASSANDRA-3647?  There's still
very little documentation on composite columns and it was not clear to me
whether they could be used to store document oriented data.  Say for
example that I had a document like:

user: {
  firstName: 'ben',
  skills: ['java', 'javascript', 'html'],
  education {
school: 'cmu',
major: 'computer science'
  }
}

How would I flatten this to be stored and then reconstruct the document?


On Thu, Mar 29, 2012 at 5:44 AM, Jake Luciani jak...@gmail.com wrote:

 Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
 would seem the only thing a JSON type offers you is validation.  3647 takes
 it much further by deconstructing a JSON document using composite columns
 to flatten the document out, with the ability to access and update portions
 of the document (as well as reconstruct it).

 On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote:

  Hi,
 
  I was wondering if it would be interesting to add some type of
  document-oriented data type.
 
  I've found it somewhat awkward to store document-oriented data in
 Cassandra
  today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it,
 but
  Cassandra cannot differentiate it from any other string or byte array.
   However, if my column validation_class could be a JsonType that would
  allow tools to potentially do more interesting introspection on the
 column
  value.  E.g. bug 3647
  https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
  supporting arbitrarily nested documents in CQL.  Running a
  query against the JSON column in Pig is possible as well, but again in
 this
  use case it would be helpful to be able to encode in column metadata that
  the column is stored as JSON.  For debugging, running nightly reports,
 etc.
  it would be quite useful compared to the opaque string and byte array
 types
  we have today.  JSON is appealing because it would be easy to implement.
   Something like Thrift or Protocol Buffers would actually be interesting
  since they would be more space efficient.  However, they would also be a
  bit more difficult to implement because of the extra typing information
  they provide.  I'm hoping with Cassandra 1.0's addition of compression
 that
  storing JSON is not too inefficient.
 
  Would there be interest in adding a JsonType?  I could look at putting a
  patch together.
 
  Thanks,
  Ben
 



 --
 http://twitter.com/tjake

Re: Document storage

2012-03-29 Thread Rick Branson

Ben,

You can create a materialized path for each field in the document:

{
[user, firstName]: ben,
[user, skills, TimeUUID]: java,
[user, skills, TimeUUID]: javascript,
[user, skills, TimeUUID]: html,
[user, education, school]: cmu,
[user, education, major]: computer science 
}

This way each field could be independently updated, and you can take 
sub-document slices with queries such as give me everything under 
user/skills. 

Rick


On Thursday, March 29, 2012 at 7:27 AM, Ben McCann wrote:

 Could you explain further how I would use CASSANDRA-3647? There's still
 very little documentation on composite columns and it was not clear to me
 whether they could be used to store document oriented data. Say for
 example that I had a document like:
 
 user: {
 firstName: 'ben',
 skills: ['java', 'javascript', 'html'],
 education {
 school: 'cmu',
 major: 'computer science'
 }
 }
 
 How would I flatten this to be stored and then reconstruct the document?
 
 
 On Thu, Mar 29, 2012 at 5:44 AM, Jake Luciani jak...@gmail.com 
 (mailto:jak...@gmail.com) wrote:
 
  Is there a reason you would prefer a JSONType over CASSANDRA-3647? It
  would seem the only thing a JSON type offers you is validation. 3647 takes
  it much further by deconstructing a JSON document using composite columns
  to flatten the document out, with the ability to access and update portions
  of the document (as well as reconstruct it).
  
  On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com 
  (mailto:b...@benmccann.com) wrote:
  
   Hi,
   
   I was wondering if it would be interesting to add some type of
   document-oriented data type.
   
   I've found it somewhat awkward to store document-oriented data in
  Cassandra
   today. I can make a JSON/Protobuf/Thrift, serialize it, and store it,
  
  
  but
   Cassandra cannot differentiate it from any other string or byte array.
   However, if my column validation_class could be a JsonType that would
   allow tools to potentially do more interesting introspection on the
  
  
  column
   value. E.g. bug 3647
   https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
   supporting arbitrarily nested documents in CQL. Running a
   query against the JSON column in Pig is possible as well, but again in
  
  
  this
   use case it would be helpful to be able to encode in column metadata that
   the column is stored as JSON. For debugging, running nightly reports,
  
  
  etc.
   it would be quite useful compared to the opaque string and byte array
  
  
  types
   we have today. JSON is appealing because it would be easy to implement.
   Something like Thrift or Protocol Buffers would actually be interesting
   since they would be more space efficient. However, they would also be a
   bit more difficult to implement because of the extra typing information
   they provide. I'm hoping with Cassandra 1.0's addition of compression
  
  
  that
   storing JSON is not too inefficient.
   
   Would there be interest in adding a JsonType? I could look at putting a
   patch together.
   
   Thanks,
   Ben
  
  
  
  
  
  --
  http://twitter.com/tjake

RE: Document storage

2012-03-29 Thread Jeremiah Jordan

Its not clear what 3647 actually is, there is no code attached, and no real 
example in it.

Aside from that, the reason this would be useful to me (if we could get 
indexing of attributes working), is that I already have my data in 
JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
break it up into columns to insert, and re-assemble into columns to read.  
Also, until we get multiple slice range reads, I can't read two different 
structures out of one row without getting all the other stuff between them, 
unless there are only two columns and I read them using column names not slices.

As it is right now I have to maintain custom indexes on all my attributes to be 
able to put ProtoBuff's into columns, and get some searching on them.  It would 
be nice if I could drop all my custom indexing code and just tell Cassandra, 
hey, index column.attr1.subattr2.

-Jeremiah

From: Jake Luciani [jak...@gmail.com]
Sent: Thursday, March 29, 2012 7:44 AM
To: dev@cassandra.apache.org
Subject: Re: Document storage

Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
would seem the only thing a JSON type offers you is validation.  3647 takes
it much further by deconstructing a JSON document using composite columns
to flatten the document out, with the ability to access and update portions
of the document (as well as reconstruct it).

On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in Cassandra
 today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
 Cassandra cannot differentiate it from any other string or byte array.
  However, if my column validation_class could be a JsonType that would
 allow tools to potentially do more interesting introspection on the column
 value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
  Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben




--
http://twitter.com/tjake

RE: Document storage

2012-03-29 Thread Jeremiah Jordan

Its not clear what 3647 actually is, there is no code attached, and no real 
example in it.

Aside from that, the reason this would be useful to me (if we could get 
indexing of attributes working), is that I already have my data in 
JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
break it up into columns to insert, and re-assemble into columns to read.  
Also, until we get multiple slice range reads, I can't read two different 
structures out of one row without getting all the other stuff between them, 
unless there are only two columns and I read them using column names not slices.

As it is right now I have to maintain custom indexes on all my attributes to be 
able to put ProtoBuff into



From: Jake Luciani [jak...@gmail.com]
Sent: Thursday, March 29, 2012 7:44 AM
To: dev@cassandra.apache.org
Subject: Re: Document storage

Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
would seem the only thing a JSON type offers you is validation.  3647 takes
it much further by deconstructing a JSON document using composite columns
to flatten the document out, with the ability to access and update portions
of the document (as well as reconstruct it).

On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in Cassandra
 today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
 Cassandra cannot differentiate it from any other string or byte array.
  However, if my column validation_class could be a JsonType that would
 allow tools to potentially do more interesting introspection on the column
 value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
  Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben




--
http://twitter.com/tjake

Re: Document storage

2012-03-29 Thread Tyler Patterson



 Would there be interest in adding a JsonType?


What about checking that data inserted into a JsonType is valid JSON? How
would you do it, and would the overhead be something we are concerned
about, especially if the JSON string is large?

Re: Document storage

Creating materialized paths may well be a possible solution.  If that were
the solution the community were to agree upon then I would like it to be a
standardized and well-documented best practice.  I asked how to store a
list of values on the user
listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html
and
no one suggested [fieldName, TimeUUID]: fieldValue.  It would be a
huge pain right now to create materialized paths like this for each of my
objects, so client library support would definitely be needed.  And the
client libraries should agree.  If Astyanax and lazyboy both add support
for materialized path and I write an object to Cassandra with Astyanax,
then I should be able to read it back with lazyboy.  The benefit of using
JSON/SMILE is that it's very clear that there's exactly one way to
serialize and deserialize the data and it's very easy.  It's not clear to
me that this is true using materialized paths.


On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson tpatter...@datastax.comwrote:

 
 
  Would there be interest in adding a JsonType?


 What about checking that data inserted into a JsonType is valid JSON? How
 would you do it, and would the overhead be something we are concerned
 about, especially if the JSON string is large?

Re: Document storage

2012-03-29 Thread Edward Capriolo

The issue with these super complex types is to do anything useful with
them you would either need scanners or co processors. As its stands
right now complex data like json is fairly opaque to Cassandra.
Getting cassandra to natively speak protobuffs or whatever flavor of
the week serialization framework is hip right now we make the codebase
very large. How is that field sorted? How is it indexed? This is
starting to go very far against the schema-less nosql grain. Where
does this end up users wanting to store binary XML index it and feed
cassandra XPath queries?


On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann b...@benmccann.com wrote:
 Creating materialized paths may well be a possible solution.  If that were
 the solution the community were to agree upon then I would like it to be a
 standardized and well-documented best practice.  I asked how to store a
 list of values on the user
 listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html
 and
 no one suggested [fieldName, TimeUUID]: fieldValue.  It would be a
 huge pain right now to create materialized paths like this for each of my
 objects, so client library support would definitely be needed.  And the
 client libraries should agree.  If Astyanax and lazyboy both add support
 for materialized path and I write an object to Cassandra with Astyanax,
 then I should be able to read it back with lazyboy.  The benefit of using
 JSON/SMILE is that it's very clear that there's exactly one way to
 serialize and deserialize the data and it's very easy.  It's not clear to
 me that this is true using materialized paths.


 On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson 
 tpatter...@datastax.comwrote:

 
 
  Would there be interest in adding a JsonType?


 What about checking that data inserted into a JsonType is valid JSON? How
 would you do it, and would the overhead be something we are concerned
 about, especially if the JSON string is large?

Re: Document storage

On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
jeremiah.jor...@morningstar.com wrote:
 Its not clear what 3647 actually is, there is no code attached, and no real 
 example in it.

 Aside from that, the reason this would be useful to me (if we could get 
 indexing of attributes working), is that I already have my data in 
 JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
 break it up into columns to insert, and re-assemble into columns to read.

I don't understand the problem.  Assuming Cassandra support for maps
and lists, I could write a Python module that takes json (or thrift,
or protobuf) objects and splits them into Cassandra rows by fields in
a couple hours.  I'm pretty sure this is essentially what Brian's REST
api for Cassandra does now.

I think this is a much better approach because that gives you the
ability to update or retrieve just parts of objects efficiently,
rather than making column values just blobs with a bunch of special
case logic to introspect them.  Which feels like a big step backwards
to me.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

RE: Document storage

2012-03-29 Thread Jeremiah Jordan

But it isn't special case logic.  The current AbstractType and Indexing of 
Abstract types for the most part would already support this.  Someone just has 
to write the code for JSONType or ProtoBuffType.

The problem isn't writing the code to break objects up, the problem is 
encode/decode time.  Encode/decode to thrift is already a significant portion 
of the time line in writing data, adding an object to column encode/decode on 
top of that makes it even longer.  For a read heavy load that wants the 
JSON/Proto as the thing to be served to clients, an increase in the write time 
line to parse/index the blob is probably acceptable, so that you don't have to 
pay the re-assemble penalty every time you hit the database for that object.

But, once we get multi range slicing, for the average case I think the break it 
up into multiple columns approach will be best for most people.  That is the 
other problem I have with doing the break into columns thing right now.  I have 
to either use Super Columns and not be able to index, so why did I break them 
up?  Or I can't get multiple objects at once, with out pulling a huge slice 
from o1 start to o5 end and then throwing away the majority of the data I 
pulled back that doesn't belong to o1 and o5

-Jeremiah


From: Jonathan Ellis [jbel...@gmail.com]
Sent: Thursday, March 29, 2012 11:23 AM
To: dev@cassandra.apache.org
Subject: Re: Document storage

On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
jeremiah.jor...@morningstar.com wrote:
 Its not clear what 3647 actually is, there is no code attached, and no real 
 example in it.

 Aside from that, the reason this would be useful to me (if we could get 
 indexing of attributes working), is that I already have my data in 
 JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
 break it up into columns to insert, and re-assemble into columns to read.

I don't understand the problem.  Assuming Cassandra support for maps
and lists, I could write a Python module that takes json (or thrift,
or protobuf) objects and splits them into Cassandra rows by fields in
a couple hours.  I'm pretty sure this is essentially what Brian's REST
api for Cassandra does now.

I think this is a much better approach because that gives you the
ability to update or retrieve just parts of objects efficiently,
rather than making column values just blobs with a bunch of special
case logic to introspect them.  Which feels like a big step backwards
to me.

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Document storage

Hi Ben,

Sure, there's nothing really to it, but I'll email it to you. As far as why I'm
using Snappy on the type instead of sstable_compression is because when you set
sstable_compression the compression happens on the Cassandra nodes and I see
two advantages with my approach:

1. Saving extra CPU usage on the Cassandra nodes. Since
compression/decompression can easily be done on the client nodes where there is
plenty idle CPU time

2. Saving network bandwidth since you're sending over a compressed byte[]

One thing to note about my approach is that when I define the schema in
Cassandra, I define the columns as byte[] and not my custom type and I do all
the conversion on the client side.

-- Drew

On Mar 29, 2012, at 12:04 AM, Ben McCann wrote:

-Ben

On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian d...@venarc.com wrote:

https://issues.apache.org/jira/browse/CASSANDRA-4081

Please vote on the JIRA if you are interested.

Validation is pretty simple, you just need to read the value and parse it
using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)

-- Drew

On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:

I don't imagine sort is a meaningful operation on JSON data. As long as
the sorting is consistent I would think that should be sufficient.

On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.com
wrote:

Some work I did stores JSON blobs in columns. The question on JSON
type is how to sort it.

On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:

Any thoughts? I'd like to submit a patch, but only if it will be
accepted.

Thanks,
Ben

On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com
wrote:

Hi,

I was wondering if it would be interesting to add some type of
document-oriented data type.

Would there be interest in adding a JsonType? I could look at
putting
a
patch together.

Thanks,
Ben

Re: Document storage

I agree with Edward here, the simpler we keep the core the better. I think all 
the ser/deser and conversions should happen on the client side.

-- Drew


On Mar 29, 2012, at 8:36 AM, Edward Capriolo wrote:

 The issue with these super complex types is to do anything useful with
 them you would either need scanners or co processors. As its stands
 right now complex data like json is fairly opaque to Cassandra.
 Getting cassandra to natively speak protobuffs or whatever flavor of
 the week serialization framework is hip right now we make the codebase
 very large. How is that field sorted? How is it indexed? This is
 starting to go very far against the schema-less nosql grain. Where
 does this end up users wanting to store binary XML index it and feed
 cassandra XPath queries?
 
 
 On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann b...@benmccann.com wrote:
 Creating materialized paths may well be a possible solution.  If that were
 the solution the community were to agree upon then I would like it to be a
 standardized and well-documented best practice.  I asked how to store a
 list of values on the user
 listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html
 and
 no one suggested [fieldName, TimeUUID]: fieldValue.  It would be a
 huge pain right now to create materialized paths like this for each of my
 objects, so client library support would definitely be needed.  And the
 client libraries should agree.  If Astyanax and lazyboy both add support
 for materialized path and I write an object to Cassandra with Astyanax,
 then I should be able to read it back with lazyboy.  The benefit of using
 JSON/SMILE is that it's very clear that there's exactly one way to
 serialize and deserialize the data and it's very easy.  It's not clear to
 me that this is true using materialized paths.
 
 
 On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson 
 tpatter...@datastax.comwrote:
 
 
 
 Would there be interest in adding a JsonType?
 
 
 What about checking that data inserted into a JsonType is valid JSON? How
 would you do it, and would the overhead be something we are concerned
 about, especially if the JSON string is large?

Re: Document storage

 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.

Unless your access pattern involves reading/writing the whole document each 
time. In that case you're better off serializing the whole document and storing 
it in a column as a byte[] without incurring the overhead of column indexes. 
Right?


On Mar 29, 2012, at 9:23 AM, Jonathan Ellis wrote:

 On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
 jeremiah.jor...@morningstar.com wrote:
 Its not clear what 3647 actually is, there is no code attached, and no real 
 example in it.
 
 Aside from that, the reason this would be useful to me (if we could get 
 indexing of attributes working), is that I already have my data in 
 JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
 break it up into columns to insert, and re-assemble into columns to read.
 
 I don't understand the problem.  Assuming Cassandra support for maps
 and lists, I could write a Python module that takes json (or thrift,
 or protobuf) objects and splits them into Cassandra rows by fields in
 a couple hours.  I'm pretty sure this is essentially what Brian's REST
 api for Cassandra does now.
 
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com

Re: Document storage

On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote:
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.

 Unless your access pattern involves reading/writing the whole document each 
 time. In that case you're better off serializing the whole document and 
 storing it in a column as a byte[] without incurring the overhead of column 
 indexes. Right?

Hmm, not sure what you're thinking of there.

If you mean the index that's part of the row header for random
access within a row, then no, serializing to byte[] doesn't save you
anything.

If you mean secondary indexes, don't declare any if you don't want any. :)

Just telling C* to store a byte[] *will* be slightly lighter-weight
than giving it named columns, but we're talking negligible compared to
the overhead of actually moving the data on or off disk in the first
place.  Not even close to being worth giving up being able to deal
with your data from standard tools like cqlsh, IMO.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Document storage

Yes, I meant the row header index. What I have done is that I'm storing an 
object (i.e. UserProfile) where you read or write it as a whole (a user updates 
their user details in a single page in the UI). So I serialize that object into 
a binary JSON using SMILE format. I then compress it using Snappy on the client 
side. So as far as Cassandra cares it's storing a byte[].

Now on the client side, I'm using cassandra-cli with a custom type that knows 
how to turn a byte[] into a JSON text and back. The only issue was 
CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 
gets fixed, I'll get the best of both worlds.

Also advantages of this vs. the thrift based Super Column families are:

1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize 
and compression/decompression happens on the client nodes where there is plenty 
idle CPU time

2. Saving network bandwidth since I'm sending over a compressed byte[]


-- Drew



On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

 On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote:
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.
 
 Unless your access pattern involves reading/writing the whole document each 
 time. In that case you're better off serializing the whole document and 
 storing it in a column as a byte[] without incurring the overhead of column 
 indexes. Right?
 
 Hmm, not sure what you're thinking of there.
 
 If you mean the index that's part of the row header for random
 access within a row, then no, serializing to byte[] doesn't save you
 anything.
 
 If you mean secondary indexes, don't declare any if you don't want any. :)
 
 Just telling C* to store a byte[] *will* be slightly lighter-weight
 than giving it named columns, but we're talking negligible compared to
 the overhead of actually moving the data on or off disk in the first
 place.  Not even close to being worth giving up being able to deal
 with your data from standard tools like cqlsh, IMO.
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com

Re: Document storage

Jonathan, I asked Brian about his REST
APIhttps://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas9C8Usand
he said he does not take the json objects and split them because the
client libraries do not agree on implementations.  This was exactly my
concern as well with this solution.  I would be perfectly happy to do it
this way instead of using JSON if it were standardized.  The reason I
suggested JSON is that it is standardized.  As far as I can tell, Cassandra
doesn't support maps and lists in a standardized way today, which is the
root of my problem.

-Ben


On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com wrote:

 Yes, I meant the row header index. What I have done is that I'm storing
 an object (i.e. UserProfile) where you read or write it as a whole (a user
 updates their user details in a single page in the UI). So I serialize that
 object into a binary JSON using SMILE format. I then compress it using
 Snappy on the client side. So as far as Cassandra cares it's storing a
 byte[].

 Now on the client side, I'm using cassandra-cli with a custom type that
 knows how to turn a byte[] into a JSON text and back. The only issue was
 CASSANDRA-4081 where assume doesn't work with custom types. If
 CASSANDRA-4081 gets fixed, I'll get the best of both worlds.

 Also advantages of this vs. the thrift based Super Column families are:

 1. Saving extra CPU usage on the Cassandra nodes. Since
 serialize/deserialize and compression/decompression happens on the client
 nodes where there is plenty idle CPU time

 2. Saving network bandwidth since I'm sending over a compressed byte[]


 -- Drew



 On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

  On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
 wrote:
  I think this is a much better approach because that gives you the
  ability to update or retrieve just parts of objects efficiently,
  rather than making column values just blobs with a bunch of special
  case logic to introspect them.  Which feels like a big step backwards
  to me.
 
  Unless your access pattern involves reading/writing the whole document
 each time. In that case you're better off serializing the whole document
 and storing it in a column as a byte[] without incurring the overhead of
 column indexes. Right?
 
  Hmm, not sure what you're thinking of there.
 
  If you mean the index that's part of the row header for random
  access within a row, then no, serializing to byte[] doesn't save you
  anything.
 
  If you mean secondary indexes, don't declare any if you don't want any.
 :)
 
  Just telling C* to store a byte[] *will* be slightly lighter-weight
  than giving it named columns, but we're talking negligible compared to
  the overhead of actually moving the data on or off disk in the first
  place.  Not even close to being worth giving up being able to deal
  with your data from standard tools like cqlsh, IMO.
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of DataStax, the source for professional Cassandra support
  http://www.datastax.com

Re: Document storage

On Thu, Mar 29, 2012 at 2:06 PM, Ben McCann b...@benmccann.com wrote:
 As far as I can tell, Cassandra
 doesn't support maps and lists in a standardized way today, which is the
 root of my problem.

I'm pretty serious about adding those for 1.2, for what that's worth.
(If you want to jump in and help code that up, so much the better.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Document storage

2012-03-29 Thread Brian O'Neill

Jonathan, 

I was actually going to take this up with Nate McCall a few weeks back.  I
think it might make sense to get the client development community together
(Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)

I agree whole-heartedly that it shouldn't go into the database for all the
reasons you point out.

If we can all decide on some standards for data storage (e.g. composite
types), indexing strategies, etc.  We can provide higher-level functions
through the client libraries and also provide interoperability between
them.  (without bloating Cassandra)

CCing Nate.  Nate, thoughts?
I wouldn't mind coordinating/facilitating the conversation.  If we know
who should be involved.

-brian

 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/







On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote:

Jonathan, I asked Brian about his REST
APIhttps://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
9C8Usand
he said he does not take the json objects and split them because the
client libraries do not agree on implementations.  This was exactly my
concern as well with this solution.  I would be perfectly happy to do it
this way instead of using JSON if it were standardized.  The reason I
suggested JSON is that it is standardized.  As far as I can tell,
Cassandra
doesn't support maps and lists in a standardized way today, which is the
root of my problem.

-Ben


On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com wrote:

 Yes, I meant the row header index. What I have done is that I'm
storing
 an object (i.e. UserProfile) where you read or write it as a whole (a
user
 updates their user details in a single page in the UI). So I serialize
that
 object into a binary JSON using SMILE format. I then compress it using
 Snappy on the client side. So as far as Cassandra cares it's storing a
 byte[].

 Now on the client side, I'm using cassandra-cli with a custom type that
 knows how to turn a byte[] into a JSON text and back. The only issue was
 CASSANDRA-4081 where assume doesn't work with custom types. If
 CASSANDRA-4081 gets fixed, I'll get the best of both worlds.

 Also advantages of this vs. the thrift based Super Column families are:

 1. Saving extra CPU usage on the Cassandra nodes. Since
 serialize/deserialize and compression/decompression happens on the
client
 nodes where there is plenty idle CPU time

 2. Saving network bandwidth since I'm sending over a compressed byte[]


 -- Drew



 On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

  On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
 wrote:
  I think this is a much better approach because that gives you the
  ability to update or retrieve just parts of objects efficiently,
  rather than making column values just blobs with a bunch of special
  case logic to introspect them.  Which feels like a big step
backwards
  to me.
 
  Unless your access pattern involves reading/writing the whole
document
 each time. In that case you're better off serializing the whole document
 and storing it in a column as a byte[] without incurring the overhead of
 column indexes. Right?
 
  Hmm, not sure what you're thinking of there.
 
  If you mean the index that's part of the row header for random
  access within a row, then no, serializing to byte[] doesn't save you
  anything.
 
  If you mean secondary indexes, don't declare any if you don't want
any.
 :)
 
  Just telling C* to store a byte[] *will* be slightly lighter-weight
  than giving it named columns, but we're talking negligible compared to
  the overhead of actually moving the data on or off disk in the first
  place.  Not even close to being worth giving up being able to deal
  with your data from standard tools like cqlsh, IMO.
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of DataStax, the source for professional Cassandra support
  http://www.datastax.com

Re: Document storage

Thanks Jonathan.  The only reason I suggested JSON was because it already
has support for lists.  Native support for lists in Cassandra would more
than satisfy me.  Are there any existing proposals or a bug I can follow?
 I'm not familiar with the Cassandra codebase, so I'm not entirely sure how
helpful I can be, but I'd certainly be interested in taking a look to see
what's required.

-Ben


On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill b...@alumni.brown.eduwrote:

 Jonathan,

 I was actually going to take this up with Nate McCall a few weeks back.  I
 think it might make sense to get the client development community together
 (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)

 I agree whole-heartedly that it shouldn't go into the database for all the
 reasons you point out.

 If we can all decide on some standards for data storage (e.g. composite
 types), indexing strategies, etc.  We can provide higher-level functions
 through the client libraries and also provide interoperability between
 them.  (without bloating Cassandra)

 CCing Nate.  Nate, thoughts?
 I wouldn't mind coordinating/facilitating the conversation.  If we know
 who should be involved.

 -brian

 
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/







 On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote:

 Jonathan, I asked Brian about his REST
 API
 https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
 9C8Usand
 he said he does not take the json objects and split them because the
 client libraries do not agree on implementations.  This was exactly my
 concern as well with this solution.  I would be perfectly happy to do it
 this way instead of using JSON if it were standardized.  The reason I
 suggested JSON is that it is standardized.  As far as I can tell,
 Cassandra
 doesn't support maps and lists in a standardized way today, which is the
 root of my problem.
 
 -Ben
 
 
 On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com
 wrote:
 
  Yes, I meant the row header index. What I have done is that I'm
 storing
  an object (i.e. UserProfile) where you read or write it as a whole (a
 user
  updates their user details in a single page in the UI). So I serialize
 that
  object into a binary JSON using SMILE format. I then compress it using
  Snappy on the client side. So as far as Cassandra cares it's storing a
  byte[].
 
  Now on the client side, I'm using cassandra-cli with a custom type that
  knows how to turn a byte[] into a JSON text and back. The only issue was
  CASSANDRA-4081 where assume doesn't work with custom types. If
  CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
 
  Also advantages of this vs. the thrift based Super Column families are:
 
  1. Saving extra CPU usage on the Cassandra nodes. Since
  serialize/deserialize and compression/decompression happens on the
 client
  nodes where there is plenty idle CPU time
 
  2. Saving network bandwidth since I'm sending over a compressed byte[]
 
 
  -- Drew
 
 
 
  On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
 
   On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
  wrote:
   I think this is a much better approach because that gives you the
   ability to update or retrieve just parts of objects efficiently,
   rather than making column values just blobs with a bunch of special
   case logic to introspect them.  Which feels like a big step
 backwards
   to me.
  
   Unless your access pattern involves reading/writing the whole
 document
  each time. In that case you're better off serializing the whole document
  and storing it in a column as a byte[] without incurring the overhead of
  column indexes. Right?
  
   Hmm, not sure what you're thinking of there.
  
   If you mean the index that's part of the row header for random
   access within a row, then no, serializing to byte[] doesn't save you
   anything.
  
   If you mean secondary indexes, don't declare any if you don't want
 any.
  :)
  
   Just telling C* to store a byte[] *will* be slightly lighter-weight
   than giving it named columns, but we're talking negligible compared to
   the overhead of actually moving the data on or off disk in the first
   place.  Not even close to being worth giving up being able to deal
   with your data from standard tools like cqlsh, IMO.
  
   --
   Jonathan Ellis
   Project Chair, Apache Cassandra
   co-founder of DataStax, the source for professional Cassandra support
   http://www.datastax.com

Re: Document storage

2012-03-29 Thread Brian O'Neill


Jonathan,

We store JSON as our column values.  I'd love to see support for maps and
lists.  If I get some time this weekend, I'll take a look to see what is
required.  I doesn't seem like it would be that hard.

-brian

 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/







On 3/29/12 3:18 PM, Jonathan Ellis jbel...@gmail.com wrote:

On Thu, Mar 29, 2012 at 2:06 PM, Ben McCann b...@benmccann.com wrote:
 As far as I can tell, Cassandra
 doesn't support maps and lists in a standardized way today, which is the
 root of my problem.

I'm pretty serious about adding those for 1.2, for what that's worth.
(If you want to jump in and help code that up, so much the better.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Document storage

I kind of hijacked
https://issues.apache.org/jira/browse/CASSANDRA-3647 (Sylvain
suggests we start with (non-nested) lists, maps, and sets. I agree
that this is a great 80/20 approach to the problem) but we could
split it out to another ticket.

On Thu, Mar 29, 2012 at 2:24 PM, Ben McCann b...@benmccann.com wrote:
 Thanks Jonathan.  The only reason I suggested JSON was because it already
 has support for lists.  Native support for lists in Cassandra would more
 than satisfy me.  Are there any existing proposals or a bug I can follow?
  I'm not familiar with the Cassandra codebase, so I'm not entirely sure how
 helpful I can be, but I'd certainly be interested in taking a look to see
 what's required.

 -Ben


 On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill b...@alumni.brown.eduwrote:

 Jonathan,

 I was actually going to take this up with Nate McCall a few weeks back.  I
 think it might make sense to get the client development community together
 (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)

 I agree whole-heartedly that it shouldn't go into the database for all the
 reasons you point out.

 If we can all decide on some standards for data storage (e.g. composite
 types), indexing strategies, etc.  We can provide higher-level functions
 through the client libraries and also provide interoperability between
 them.  (without bloating Cassandra)

 CCing Nate.  Nate, thoughts?
 I wouldn't mind coordinating/facilitating the conversation.  If we know
 who should be involved.

 -brian

 
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/







 On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote:

 Jonathan, I asked Brian about his REST
 API
 https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
 9C8Usand
 he said he does not take the json objects and split them because the
 client libraries do not agree on implementations.  This was exactly my
 concern as well with this solution.  I would be perfectly happy to do it
 this way instead of using JSON if it were standardized.  The reason I
 suggested JSON is that it is standardized.  As far as I can tell,
 Cassandra
 doesn't support maps and lists in a standardized way today, which is the
 root of my problem.
 
 -Ben
 
 
 On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com
 wrote:
 
  Yes, I meant the row header index. What I have done is that I'm
 storing
  an object (i.e. UserProfile) where you read or write it as a whole (a
 user
  updates their user details in a single page in the UI). So I serialize
 that
  object into a binary JSON using SMILE format. I then compress it using
  Snappy on the client side. So as far as Cassandra cares it's storing a
  byte[].
 
  Now on the client side, I'm using cassandra-cli with a custom type that
  knows how to turn a byte[] into a JSON text and back. The only issue was
  CASSANDRA-4081 where assume doesn't work with custom types. If
  CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
 
  Also advantages of this vs. the thrift based Super Column families are:
 
  1. Saving extra CPU usage on the Cassandra nodes. Since
  serialize/deserialize and compression/decompression happens on the
 client
  nodes where there is plenty idle CPU time
 
  2. Saving network bandwidth since I'm sending over a compressed byte[]
 
 
  -- Drew
 
 
 
  On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
 
   On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
  wrote:
   I think this is a much better approach because that gives you the
   ability to update or retrieve just parts of objects efficiently,
   rather than making column values just blobs with a bunch of special
   case logic to introspect them.  Which feels like a big step
 backwards
   to me.
  
   Unless your access pattern involves reading/writing the whole
 document
  each time. In that case you're better off serializing the whole document
  and storing it in a column as a byte[] without incurring the overhead of
  column indexes. Right?
  
   Hmm, not sure what you're thinking of there.
  
   If you mean the index that's part of the row header for random
   access within a row, then no, serializing to byte[] doesn't save you
   anything.
  
   If you mean secondary indexes, don't declare any if you don't want
 any.
  :)
  
   Just telling C* to store a byte[] *will* be slightly lighter-weight
   than giving it named columns, but we're talking negligible compared to
   the overhead of actually moving the data on or off disk in the first
   place.  Not even close to being worth giving up being able to deal
   with your data from standard tools like cqlsh, IMO.
  
   --
   Jonathan Ellis
   Project Chair, Apache Cassandra
   co-founder of DataStax, the source for professional Cassandra support
   http://www.datastax.com
 
 






--

Re: Document storage