Re: Document storage

2012-03-30 Thread Daniel Doubleday
 Just telling C* to store a byte[] *will* be slightly lighter-weight
 than giving it named columns, but we're talking negligible compared to
 the overhead of actually moving the data on or off disk in the first
 place. 
Hm - but isn't this exactly the point? You don't want to move data off disk.
But decomposing into columns will lead to more of that:

- Total amount of serialized data is (in most cases a lot) larger than 
protobuffed / compressed version
- If you do selective updates the document will be scattered over multiple ssts 
plus if you do sliced reads you can't optimize reads as opposed to the single 
column version that when updated is automatically superseding older versions so 
most reads will hit only one sst

All these reads make the hot dataset. If it fits the page cache your fine. If 
it doesn't you need to buy more iron.

Really could not resist because your statement seems to be contrary to all our 
tests / learnings.

Cheers,
Daniel

From dev list:

Re: Document storage
On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote:
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.

 Unless your access pattern involves reading/writing the whole document each 
 time. In
that case you're better off serializing the whole document and storing it in a 
column as a
byte[] without incurring the overhead of column indexes. Right?

Hmm, not sure what you're thinking of there.

If you mean the index that's part of the row header for random
access within a row, then no, serializing to byte[] doesn't save you
anything.

If you mean secondary indexes, don't declare any if you don't want any. :)

Just telling C* to store a byte[] *will* be slightly lighter-weight
than giving it named columns, but we're talking negligible compared to
the overhead of actually moving the data on or off disk in the first
place.  Not even close to being worth giving up being able to deal
with your data from standard tools like cqlsh, IMO.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com



Re: Document storage

2012-03-30 Thread Brian O'Neill

Do we also need to consider the client API?
If we don't adjust thrift, the client just gets bytes right?
The client is on their own to marshal back into a structure.  In this
case, it seems like we would want to chose a standard that is efficient
and for which there are common libraries.  Protobuf seems to fit the bill
here.  

Or do we pass back some other structure?  (Native lists/maps? JSON
strings?)

Do we ignore sorting/comparators?
(similar to SOLR, I'm not sure people have defined a good sort for
multi-valued items)

-brian

 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



On 3/30/12 12:01 PM, Daniel Doubleday daniel.double...@gmx.net wrote:

 Just telling C* to store a byte[] *will* be slightly lighter-weight
 than giving it named columns, but we're talking negligible compared to
 the overhead of actually moving the data on or off disk in the first
 place. 
Hm - but isn't this exactly the point? You don't want to move data off
disk.
But decomposing into columns will lead to more of that:

- Total amount of serialized data is (in most cases a lot) larger than
protobuffed / compressed version
- If you do selective updates the document will be scattered over
multiple ssts plus if you do sliced reads you can't optimize reads as
opposed to the single column version that when updated is automatically
superseding older versions so most reads will hit only one sst

All these reads make the hot dataset. If it fits the page cache your
fine. If it doesn't you need to buy more iron.

Really could not resist because your statement seems to be contrary to
all our tests / learnings.

Cheers,
Daniel

From dev list:

Re: Document storage
On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote:
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.

 Unless your access pattern involves reading/writing the whole document
each time. In
that case you're better off serializing the whole document and storing it
in a column as a
byte[] without incurring the overhead of column indexes. Right?

Hmm, not sure what you're thinking of there.

If you mean the index that's part of the row header for random
access within a row, then no, serializing to byte[] doesn't save you
anything.

If you mean secondary indexes, don't declare any if you don't want any. :)

Just telling C* to store a byte[] *will* be slightly lighter-weight
than giving it named columns, but we're talking negligible compared to
the overhead of actually moving the data on or off disk in the first
place.  Not even close to being worth giving up being able to deal
with your data from standard tools like cqlsh, IMO.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com





Re: Document storage

2012-03-30 Thread Sylvain Lebresne
On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday
daniel.double...@gmx.net wrote:
 But decomposing into columns will lead to more of that:

 - Total amount of serialized data is (in most cases a lot) larger than 
 protobuffed / compressed version

At least with sstable compression, I would expect the difference to
not be too big in practice.

 - If you do selective updates the document will be scattered over multiple 
 ssts plus if you do sliced reads you can't optimize reads as opposed to the 
 single column version that when updated is automatically superseding older 
 versions so most reads will hit only one sst

But if you need to do selective updates, then a blob just doesn't work
so that comparison is moot.

Now I don't think anyone pretended that you should never use blobs
(whether that's protobuffed, jsoned, ...). If you don't need selected
updates and having something as compact as possible on disk make a
important difference for you, sure, do use blobs. The only argument is
that you can already do that without any change to the core. What we
are saying is that for the case where you care more about schema
flexibility (being able to do selective updates, to index on some
subpart, etc...) then we think that something like the map and list
idea of CASSANDRA-3647 will probably be a more natural fit to the
current CQL API.

--
Sylvain


 All these reads make the hot dataset. If it fits the page cache your fine. If 
 it doesn't you need to buy more iron.

 Really could not resist because your statement seems to be contrary to all 
 our tests / learnings.

 Cheers,
 Daniel

 From dev list:

 Re: Document storage
 On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote:
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.

 Unless your access pattern involves reading/writing the whole document each 
 time. In
 that case you're better off serializing the whole document and storing it in 
 a column as a
 byte[] without incurring the overhead of column indexes. Right?

 Hmm, not sure what you're thinking of there.

 If you mean the index that's part of the row header for random
 access within a row, then no, serializing to byte[] doesn't save you
 anything.

 If you mean secondary indexes, don't declare any if you don't want any. :)

 Just telling C* to store a byte[] *will* be slightly lighter-weight
 than giving it named columns, but we're talking negligible compared to
 the overhead of actually moving the data on or off disk in the first
 place.  Not even close to being worth giving up being able to deal
 with your data from standard tools like cqlsh, IMO.

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: Document storage

2012-03-30 Thread Ben McCann

 If you don't need selected updates and having something as compact as
 possible on disk make a important difference for you, sure, do use blobs.
 The only argument is that you can already do that without any change to
 the core.


The thing that we can't do today without changes to the core is index on
subparts of some document format like Protobuf/JSON/etc.  If cassandra were
to understand one of these formats, it could remove the need for manual
management of an index.


On Fri, Mar 30, 2012 at 10:23 AM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday
 daniel.double...@gmx.net wrote:
  But decomposing into columns will lead to more of that:
 
  - Total amount of serialized data is (in most cases a lot) larger than
 protobuffed / compressed version

 At least with sstable compression, I would expect the difference to
 not be too big in practice.

  - If you do selective updates the document will be scattered over
 multiple ssts plus if you do sliced reads you can't optimize reads as
 opposed to the single column version that when updated is automatically
 superseding older versions so most reads will hit only one sst

 But if you need to do selective updates, then a blob just doesn't work
 so that comparison is moot.

 Now I don't think anyone pretended that you should never use blobs
 (whether that's protobuffed, jsoned, ...). If you don't need selected
 updates and having something as compact as possible on disk make a
 important difference for you, sure, do use blobs. The only argument is
 that you can already do that without any change to the core. What we
 are saying is that for the case where you care more about schema
 flexibility (being able to do selective updates, to index on some
 subpart, etc...) then we think that something like the map and list
 idea of CASSANDRA-3647 will probably be a more natural fit to the
 current CQL API.

 --
 Sylvain

 
  All these reads make the hot dataset. If it fits the page cache your
 fine. If it doesn't you need to buy more iron.
 
  Really could not resist because your statement seems to be contrary to
 all our tests / learnings.
 
  Cheers,
  Daniel
 
  From dev list:
 
  Re: Document storage
  On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
 wrote:
  I think this is a much better approach because that gives you the
  ability to update or retrieve just parts of objects efficiently,
  rather than making column values just blobs with a bunch of special
  case logic to introspect them.  Which feels like a big step backwards
  to me.
 
  Unless your access pattern involves reading/writing the whole document
 each time. In
  that case you're better off serializing the whole document and storing
 it in a column as a
  byte[] without incurring the overhead of column indexes. Right?
 
  Hmm, not sure what you're thinking of there.
 
  If you mean the index that's part of the row header for random
  access within a row, then no, serializing to byte[] doesn't save you
  anything.
 
  If you mean secondary indexes, don't declare any if you don't want any.
 :)
 
  Just telling C* to store a byte[] *will* be slightly lighter-weight
  than giving it named columns, but we're talking negligible compared to
  the overhead of actually moving the data on or off disk in the first
  place.  Not even close to being worth giving up being able to deal
  with your data from standard tools like cqlsh, IMO.
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of DataStax, the source for professional Cassandra support
  http://www.datastax.com
 



Re: Document storage

2012-03-29 Thread Drew Kutcharian
I'm actually doing something almost the same. I serialize my objects into 
byte[] using Jackson's SMILE format, then compress it using Snappy then store 
the byte[] in Cassandra. I actually created a simple Cassandra Type for this 
but I hit a wall with cassandra-cli:

https://issues.apache.org/jira/browse/CASSANDRA-4081

Please vote on the JIRA if you are interested.

Validation is pretty simple, you just need to read the value and parse it using 
Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)

-- Drew



On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:

 I don't imagine sort is a meaningful operation on JSON data.  As long as
 the sorting is consistent I would think that should be sufficient.
 
 
 On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote:
 
 Some work I did stores JSON blobs in columns. The question on JSON
 type is how to sort it.
 
 On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
 jeremy.hanna1...@gmail.com wrote:
 I don't speak for the project, but you might give it a day or two for
 people to respond and/or perhaps create a jira ticket.  Seems like that's a
 reasonable data type that would get some traction - a json type.  However,
 what would validation look like?  That's one of the main reasons there are
 the data types and validators, in order to validate on insert.
 
 On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
 
 Any thoughts?  I'd like to submit a patch, but only if it will be
 accepted.
 
 Thanks,
 Ben
 
 
 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:
 
 Hi,
 
 I was wondering if it would be interesting to add some type of
 document-oriented data type.
 
 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
 store it, but Cassandra cannot differentiate it from any other string
 or
 byte array.  However, if my column validation_class could be a JsonType
 that would allow tools to potentially do more interesting
 introspection on
 the column value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting
 arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in
 this
 use case it would be helpful to be able to encode in column metadata
 that
 the column is stored as JSON.  For debugging, running nightly reports,
 etc.
 it would be quite useful compared to the opaque string and byte array
 types
 we have today.  JSON is appealing because it would be easy to
 implement.
 Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be
 a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression
 that
 storing JSON is not too inefficient.
 
 Would there be interest in adding a JsonType?  I could look at putting
 a
 patch together.
 
 Thanks,
 Ben
 
 
 
 



Re: Document storage

2012-03-29 Thread Ben McCann
Sounds awesome Drew.  Mind sharing your custom type?  I just wrote a basic
JSON type and did the validation the same way you did, but I don't have any
SMILE support yet.  It seems that if your type were committed to the
Cassandra codebase then the issue you ran into of the CLI only supporting
built-in types would no longer be a problem for you (though fixing the
issue anyway would be good and I voted for it).  Btw, any reason you
compress it with Snappy yourself instead of just setting sstable_compression
to 
SnappyCompressorhttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compressionand
letting Cassandra do that part?

-Ben


On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian d...@venarc.com wrote:

 I'm actually doing something almost the same. I serialize my objects into
 byte[] using Jackson's SMILE format, then compress it using Snappy then
 store the byte[] in Cassandra. I actually created a simple Cassandra Type
 for this but I hit a wall with cassandra-cli:

 https://issues.apache.org/jira/browse/CASSANDRA-4081

 Please vote on the JIRA if you are interested.

 Validation is pretty simple, you just need to read the value and parse it
 using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)

 -- Drew



 On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:

  I don't imagine sort is a meaningful operation on JSON data.  As long as
  the sorting is consistent I would think that should be sufficient.
 
 
  On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
  Some work I did stores JSON blobs in columns. The question on JSON
  type is how to sort it.
 
  On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
  jeremy.hanna1...@gmail.com wrote:
  I don't speak for the project, but you might give it a day or two for
  people to respond and/or perhaps create a jira ticket.  Seems like
 that's a
  reasonable data type that would get some traction - a json type.
  However,
  what would validation look like?  That's one of the main reasons there
 are
  the data types and validators, in order to validate on insert.
 
  On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
 
  Any thoughts?  I'd like to submit a patch, but only if it will be
  accepted.
 
  Thanks,
  Ben
 
 
  On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com
 wrote:
 
  Hi,
 
  I was wondering if it would be interesting to add some type of
  document-oriented data type.
 
  I've found it somewhat awkward to store document-oriented data in
  Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it,
 and
  store it, but Cassandra cannot differentiate it from any other string
  or
  byte array.  However, if my column validation_class could be a
 JsonType
  that would allow tools to potentially do more interesting
  introspection on
  the column value.  E.g. bug 3647
  https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting
  arbitrarily nested documents in CQL.  Running a
  query against the JSON column in Pig is possible as well, but again
 in
  this
  use case it would be helpful to be able to encode in column metadata
  that
  the column is stored as JSON.  For debugging, running nightly
 reports,
  etc.
  it would be quite useful compared to the opaque string and byte array
  types
  we have today.  JSON is appealing because it would be easy to
  implement.
  Something like Thrift or Protocol Buffers would actually be
 interesting
  since they would be more space efficient.  However, they would also
 be
  a
  bit more difficult to implement because of the extra typing
 information
  they provide.  I'm hoping with Cassandra 1.0's addition of
 compression
  that
  storing JSON is not too inefficient.
 
  Would there be interest in adding a JsonType?  I could look at
 putting
  a
  patch together.
 
  Thanks,
  Ben
 
 
 
 




Re: Document storage

2012-03-29 Thread Jake Luciani
Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
would seem the only thing a JSON type offers you is validation.  3647 takes
it much further by deconstructing a JSON document using composite columns
to flatten the document out, with the ability to access and update portions
of the document (as well as reconstruct it).

On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in Cassandra
 today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
 Cassandra cannot differentiate it from any other string or byte array.
  However, if my column validation_class could be a JsonType that would
 allow tools to potentially do more interesting introspection on the column
 value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
  Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben




-- 
http://twitter.com/tjake


Re: Document storage

2012-03-29 Thread Ben McCann
Could you explain further how I would use CASSANDRA-3647?  There's still
very little documentation on composite columns and it was not clear to me
whether they could be used to store document oriented data.  Say for
example that I had a document like:

user: {
  firstName: 'ben',
  skills: ['java', 'javascript', 'html'],
  education {
school: 'cmu',
major: 'computer science'
  }
}

How would I flatten this to be stored and then reconstruct the document?


On Thu, Mar 29, 2012 at 5:44 AM, Jake Luciani jak...@gmail.com wrote:

 Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
 would seem the only thing a JSON type offers you is validation.  3647 takes
 it much further by deconstructing a JSON document using composite columns
 to flatten the document out, with the ability to access and update portions
 of the document (as well as reconstruct it).

 On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote:

  Hi,
 
  I was wondering if it would be interesting to add some type of
  document-oriented data type.
 
  I've found it somewhat awkward to store document-oriented data in
 Cassandra
  today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it,
 but
  Cassandra cannot differentiate it from any other string or byte array.
   However, if my column validation_class could be a JsonType that would
  allow tools to potentially do more interesting introspection on the
 column
  value.  E.g. bug 3647
  https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
  supporting arbitrarily nested documents in CQL.  Running a
  query against the JSON column in Pig is possible as well, but again in
 this
  use case it would be helpful to be able to encode in column metadata that
  the column is stored as JSON.  For debugging, running nightly reports,
 etc.
  it would be quite useful compared to the opaque string and byte array
 types
  we have today.  JSON is appealing because it would be easy to implement.
   Something like Thrift or Protocol Buffers would actually be interesting
  since they would be more space efficient.  However, they would also be a
  bit more difficult to implement because of the extra typing information
  they provide.  I'm hoping with Cassandra 1.0's addition of compression
 that
  storing JSON is not too inefficient.
 
  Would there be interest in adding a JsonType?  I could look at putting a
  patch together.
 
  Thanks,
  Ben
 



 --
 http://twitter.com/tjake



Re: Document storage

2012-03-29 Thread Rick Branson
Ben,

You can create a materialized path for each field in the document:

{
[user, firstName]: ben,
[user, skills, TimeUUID]: java,
[user, skills, TimeUUID]: javascript,
[user, skills, TimeUUID]: html,
[user, education, school]: cmu,
[user, education, major]: computer science 
}

This way each field could be independently updated, and you can take 
sub-document slices with queries such as give me everything under 
user/skills. 

Rick


On Thursday, March 29, 2012 at 7:27 AM, Ben McCann wrote:

 Could you explain further how I would use CASSANDRA-3647? There's still
 very little documentation on composite columns and it was not clear to me
 whether they could be used to store document oriented data. Say for
 example that I had a document like:
 
 user: {
 firstName: 'ben',
 skills: ['java', 'javascript', 'html'],
 education {
 school: 'cmu',
 major: 'computer science'
 }
 }
 
 How would I flatten this to be stored and then reconstruct the document?
 
 
 On Thu, Mar 29, 2012 at 5:44 AM, Jake Luciani jak...@gmail.com 
 (mailto:jak...@gmail.com) wrote:
 
  Is there a reason you would prefer a JSONType over CASSANDRA-3647? It
  would seem the only thing a JSON type offers you is validation. 3647 takes
  it much further by deconstructing a JSON document using composite columns
  to flatten the document out, with the ability to access and update portions
  of the document (as well as reconstruct it).
  
  On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com 
  (mailto:b...@benmccann.com) wrote:
  
   Hi,
   
   I was wondering if it would be interesting to add some type of
   document-oriented data type.
   
   I've found it somewhat awkward to store document-oriented data in
  Cassandra
   today. I can make a JSON/Protobuf/Thrift, serialize it, and store it,
  
  
  but
   Cassandra cannot differentiate it from any other string or byte array.
   However, if my column validation_class could be a JsonType that would
   allow tools to potentially do more interesting introspection on the
  
  
  column
   value. E.g. bug 3647
   https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
   supporting arbitrarily nested documents in CQL. Running a
   query against the JSON column in Pig is possible as well, but again in
  
  
  this
   use case it would be helpful to be able to encode in column metadata that
   the column is stored as JSON. For debugging, running nightly reports,
  
  
  etc.
   it would be quite useful compared to the opaque string and byte array
  
  
  types
   we have today. JSON is appealing because it would be easy to implement.
   Something like Thrift or Protocol Buffers would actually be interesting
   since they would be more space efficient. However, they would also be a
   bit more difficult to implement because of the extra typing information
   they provide. I'm hoping with Cassandra 1.0's addition of compression
  
  
  that
   storing JSON is not too inefficient.
   
   Would there be interest in adding a JsonType? I could look at putting a
   patch together.
   
   Thanks,
   Ben
  
  
  
  
  
  --
  http://twitter.com/tjake
 





RE: Document storage

2012-03-29 Thread Jeremiah Jordan
Its not clear what 3647 actually is, there is no code attached, and no real 
example in it.

Aside from that, the reason this would be useful to me (if we could get 
indexing of attributes working), is that I already have my data in 
JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
break it up into columns to insert, and re-assemble into columns to read.  
Also, until we get multiple slice range reads, I can't read two different 
structures out of one row without getting all the other stuff between them, 
unless there are only two columns and I read them using column names not slices.

As it is right now I have to maintain custom indexes on all my attributes to be 
able to put ProtoBuff's into columns, and get some searching on them.  It would 
be nice if I could drop all my custom indexing code and just tell Cassandra, 
hey, index column.attr1.subattr2.

-Jeremiah

From: Jake Luciani [jak...@gmail.com]
Sent: Thursday, March 29, 2012 7:44 AM
To: dev@cassandra.apache.org
Subject: Re: Document storage

Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
would seem the only thing a JSON type offers you is validation.  3647 takes
it much further by deconstructing a JSON document using composite columns
to flatten the document out, with the ability to access and update portions
of the document (as well as reconstruct it).

On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in Cassandra
 today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
 Cassandra cannot differentiate it from any other string or byte array.
  However, if my column validation_class could be a JsonType that would
 allow tools to potentially do more interesting introspection on the column
 value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
  Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben




--
http://twitter.com/tjake


RE: Document storage

2012-03-29 Thread Jeremiah Jordan
Its not clear what 3647 actually is, there is no code attached, and no real 
example in it.

Aside from that, the reason this would be useful to me (if we could get 
indexing of attributes working), is that I already have my data in 
JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
break it up into columns to insert, and re-assemble into columns to read.  
Also, until we get multiple slice range reads, I can't read two different 
structures out of one row without getting all the other stuff between them, 
unless there are only two columns and I read them using column names not slices.

As it is right now I have to maintain custom indexes on all my attributes to be 
able to put ProtoBuff into



From: Jake Luciani [jak...@gmail.com]
Sent: Thursday, March 29, 2012 7:44 AM
To: dev@cassandra.apache.org
Subject: Re: Document storage

Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
would seem the only thing a JSON type offers you is validation.  3647 takes
it much further by deconstructing a JSON document using composite columns
to flatten the document out, with the ability to access and update portions
of the document (as well as reconstruct it).

On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in Cassandra
 today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
 Cassandra cannot differentiate it from any other string or byte array.
  However, if my column validation_class could be a JsonType that would
 allow tools to potentially do more interesting introspection on the column
 value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
  Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben




--
http://twitter.com/tjake


Re: Document storage

2012-03-29 Thread Tyler Patterson


 Would there be interest in adding a JsonType?


What about checking that data inserted into a JsonType is valid JSON? How
would you do it, and would the overhead be something we are concerned
about, especially if the JSON string is large?


Re: Document storage

2012-03-29 Thread Ben McCann
Creating materialized paths may well be a possible solution.  If that were
the solution the community were to agree upon then I would like it to be a
standardized and well-documented best practice.  I asked how to store a
list of values on the user
listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html
and
no one suggested [fieldName, TimeUUID]: fieldValue.  It would be a
huge pain right now to create materialized paths like this for each of my
objects, so client library support would definitely be needed.  And the
client libraries should agree.  If Astyanax and lazyboy both add support
for materialized path and I write an object to Cassandra with Astyanax,
then I should be able to read it back with lazyboy.  The benefit of using
JSON/SMILE is that it's very clear that there's exactly one way to
serialize and deserialize the data and it's very easy.  It's not clear to
me that this is true using materialized paths.


On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson tpatter...@datastax.comwrote:

 
 
  Would there be interest in adding a JsonType?


 What about checking that data inserted into a JsonType is valid JSON? How
 would you do it, and would the overhead be something we are concerned
 about, especially if the JSON string is large?



Re: Document storage

2012-03-29 Thread Edward Capriolo
The issue with these super complex types is to do anything useful with
them you would either need scanners or co processors. As its stands
right now complex data like json is fairly opaque to Cassandra.
Getting cassandra to natively speak protobuffs or whatever flavor of
the week serialization framework is hip right now we make the codebase
very large. How is that field sorted? How is it indexed? This is
starting to go very far against the schema-less nosql grain. Where
does this end up users wanting to store binary XML index it and feed
cassandra XPath queries?


On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann b...@benmccann.com wrote:
 Creating materialized paths may well be a possible solution.  If that were
 the solution the community were to agree upon then I would like it to be a
 standardized and well-documented best practice.  I asked how to store a
 list of values on the user
 listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html
 and
 no one suggested [fieldName, TimeUUID]: fieldValue.  It would be a
 huge pain right now to create materialized paths like this for each of my
 objects, so client library support would definitely be needed.  And the
 client libraries should agree.  If Astyanax and lazyboy both add support
 for materialized path and I write an object to Cassandra with Astyanax,
 then I should be able to read it back with lazyboy.  The benefit of using
 JSON/SMILE is that it's very clear that there's exactly one way to
 serialize and deserialize the data and it's very easy.  It's not clear to
 me that this is true using materialized paths.


 On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson 
 tpatter...@datastax.comwrote:

 
 
  Would there be interest in adding a JsonType?


 What about checking that data inserted into a JsonType is valid JSON? How
 would you do it, and would the overhead be something we are concerned
 about, especially if the JSON string is large?



Re: Document storage

2012-03-29 Thread Jonathan Ellis
On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
jeremiah.jor...@morningstar.com wrote:
 Its not clear what 3647 actually is, there is no code attached, and no real 
 example in it.

 Aside from that, the reason this would be useful to me (if we could get 
 indexing of attributes working), is that I already have my data in 
 JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
 break it up into columns to insert, and re-assemble into columns to read.

I don't understand the problem.  Assuming Cassandra support for maps
and lists, I could write a Python module that takes json (or thrift,
or protobuf) objects and splits them into Cassandra rows by fields in
a couple hours.  I'm pretty sure this is essentially what Brian's REST
api for Cassandra does now.

I think this is a much better approach because that gives you the
ability to update or retrieve just parts of objects efficiently,
rather than making column values just blobs with a bunch of special
case logic to introspect them.  Which feels like a big step backwards
to me.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


RE: Document storage

2012-03-29 Thread Jeremiah Jordan
But it isn't special case logic.  The current AbstractType and Indexing of 
Abstract types for the most part would already support this.  Someone just has 
to write the code for JSONType or ProtoBuffType.

The problem isn't writing the code to break objects up, the problem is 
encode/decode time.  Encode/decode to thrift is already a significant portion 
of the time line in writing data, adding an object to column encode/decode on 
top of that makes it even longer.  For a read heavy load that wants the 
JSON/Proto as the thing to be served to clients, an increase in the write time 
line to parse/index the blob is probably acceptable, so that you don't have to 
pay the re-assemble penalty every time you hit the database for that object.

But, once we get multi range slicing, for the average case I think the break it 
up into multiple columns approach will be best for most people.  That is the 
other problem I have with doing the break into columns thing right now.  I have 
to either use Super Columns and not be able to index, so why did I break them 
up?  Or I can't get multiple objects at once, with out pulling a huge slice 
from o1 start to o5 end and then throwing away the majority of the data I 
pulled back that doesn't belong to o1 and o5

-Jeremiah


From: Jonathan Ellis [jbel...@gmail.com]
Sent: Thursday, March 29, 2012 11:23 AM
To: dev@cassandra.apache.org
Subject: Re: Document storage

On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
jeremiah.jor...@morningstar.com wrote:
 Its not clear what 3647 actually is, there is no code attached, and no real 
 example in it.

 Aside from that, the reason this would be useful to me (if we could get 
 indexing of attributes working), is that I already have my data in 
 JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
 break it up into columns to insert, and re-assemble into columns to read.

I don't understand the problem.  Assuming Cassandra support for maps
and lists, I could write a Python module that takes json (or thrift,
or protobuf) objects and splits them into Cassandra rows by fields in
a couple hours.  I'm pretty sure this is essentially what Brian's REST
api for Cassandra does now.

I think this is a much better approach because that gives you the
ability to update or retrieve just parts of objects efficiently,
rather than making column values just blobs with a bunch of special
case logic to introspect them.  Which feels like a big step backwards
to me.

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Document storage

2012-03-29 Thread Drew Kutcharian
Hi Ben,

Sure, there's nothing really to it, but I'll email it to you. As far as why I'm 
using Snappy on the type instead of sstable_compression is because when you set 
sstable_compression the compression happens on the Cassandra nodes and I see 
two advantages with my approach:

1. Saving extra CPU usage on the Cassandra nodes. Since 
compression/decompression can easily be done on the client nodes where there is 
plenty idle CPU time

2. Saving network bandwidth since you're sending over a compressed byte[]

One thing to note about my approach is that when I define the schema in 
Cassandra, I define the columns as byte[] and not my custom type and I do all 
the conversion on the client side.

-- Drew




On Mar 29, 2012, at 12:04 AM, Ben McCann wrote:

 Sounds awesome Drew.  Mind sharing your custom type?  I just wrote a basic
 JSON type and did the validation the same way you did, but I don't have any
 SMILE support yet.  It seems that if your type were committed to the
 Cassandra codebase then the issue you ran into of the CLI only supporting
 built-in types would no longer be a problem for you (though fixing the
 issue anyway would be good and I voted for it).  Btw, any reason you
 compress it with Snappy yourself instead of just setting sstable_compression
 to 
 SnappyCompressorhttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compressionand
 letting Cassandra do that part?
 
 -Ben
 
 
 On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian d...@venarc.com wrote:
 
 I'm actually doing something almost the same. I serialize my objects into
 byte[] using Jackson's SMILE format, then compress it using Snappy then
 store the byte[] in Cassandra. I actually created a simple Cassandra Type
 for this but I hit a wall with cassandra-cli:
 
 https://issues.apache.org/jira/browse/CASSANDRA-4081
 
 Please vote on the JIRA if you are interested.
 
 Validation is pretty simple, you just need to read the value and parse it
 using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)
 
 -- Drew
 
 
 
 On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:
 
 I don't imagine sort is a meaningful operation on JSON data.  As long as
 the sorting is consistent I would think that should be sufficient.
 
 
 On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
 Some work I did stores JSON blobs in columns. The question on JSON
 type is how to sort it.
 
 On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
 jeremy.hanna1...@gmail.com wrote:
 I don't speak for the project, but you might give it a day or two for
 people to respond and/or perhaps create a jira ticket.  Seems like
 that's a
 reasonable data type that would get some traction - a json type.
 However,
 what would validation look like?  That's one of the main reasons there
 are
 the data types and validators, in order to validate on insert.
 
 On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
 
 Any thoughts?  I'd like to submit a patch, but only if it will be
 accepted.
 
 Thanks,
 Ben
 
 
 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com
 wrote:
 
 Hi,
 
 I was wondering if it would be interesting to add some type of
 document-oriented data type.
 
 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it,
 and
 store it, but Cassandra cannot differentiate it from any other string
 or
 byte array.  However, if my column validation_class could be a
 JsonType
 that would allow tools to potentially do more interesting
 introspection on
 the column value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting
 arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again
 in
 this
 use case it would be helpful to be able to encode in column metadata
 that
 the column is stored as JSON.  For debugging, running nightly
 reports,
 etc.
 it would be quite useful compared to the opaque string and byte array
 types
 we have today.  JSON is appealing because it would be easy to
 implement.
 Something like Thrift or Protocol Buffers would actually be
 interesting
 since they would be more space efficient.  However, they would also
 be
 a
 bit more difficult to implement because of the extra typing
 information
 they provide.  I'm hoping with Cassandra 1.0's addition of
 compression
 that
 storing JSON is not too inefficient.
 
 Would there be interest in adding a JsonType?  I could look at
 putting
 a
 patch together.
 
 Thanks,
 Ben
 
 
 
 
 
 



Re: Document storage

2012-03-29 Thread Drew Kutcharian
I agree with Edward here, the simpler we keep the core the better. I think all 
the ser/deser and conversions should happen on the client side.

-- Drew


On Mar 29, 2012, at 8:36 AM, Edward Capriolo wrote:

 The issue with these super complex types is to do anything useful with
 them you would either need scanners or co processors. As its stands
 right now complex data like json is fairly opaque to Cassandra.
 Getting cassandra to natively speak protobuffs or whatever flavor of
 the week serialization framework is hip right now we make the codebase
 very large. How is that field sorted? How is it indexed? This is
 starting to go very far against the schema-less nosql grain. Where
 does this end up users wanting to store binary XML index it and feed
 cassandra XPath queries?
 
 
 On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann b...@benmccann.com wrote:
 Creating materialized paths may well be a possible solution.  If that were
 the solution the community were to agree upon then I would like it to be a
 standardized and well-documented best practice.  I asked how to store a
 list of values on the user
 listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html
 and
 no one suggested [fieldName, TimeUUID]: fieldValue.  It would be a
 huge pain right now to create materialized paths like this for each of my
 objects, so client library support would definitely be needed.  And the
 client libraries should agree.  If Astyanax and lazyboy both add support
 for materialized path and I write an object to Cassandra with Astyanax,
 then I should be able to read it back with lazyboy.  The benefit of using
 JSON/SMILE is that it's very clear that there's exactly one way to
 serialize and deserialize the data and it's very easy.  It's not clear to
 me that this is true using materialized paths.
 
 
 On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson 
 tpatter...@datastax.comwrote:
 
 
 
 Would there be interest in adding a JsonType?
 
 
 What about checking that data inserted into a JsonType is valid JSON? How
 would you do it, and would the overhead be something we are concerned
 about, especially if the JSON string is large?
 



Re: Document storage

2012-03-29 Thread Drew Kutcharian
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.

Unless your access pattern involves reading/writing the whole document each 
time. In that case you're better off serializing the whole document and storing 
it in a column as a byte[] without incurring the overhead of column indexes. 
Right?


On Mar 29, 2012, at 9:23 AM, Jonathan Ellis wrote:

 On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
 jeremiah.jor...@morningstar.com wrote:
 Its not clear what 3647 actually is, there is no code attached, and no real 
 example in it.
 
 Aside from that, the reason this would be useful to me (if we could get 
 indexing of attributes working), is that I already have my data in 
 JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
 break it up into columns to insert, and re-assemble into columns to read.
 
 I don't understand the problem.  Assuming Cassandra support for maps
 and lists, I could write a Python module that takes json (or thrift,
 or protobuf) objects and splits them into Cassandra rows by fields in
 a couple hours.  I'm pretty sure this is essentially what Brian's REST
 api for Cassandra does now.
 
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: Document storage

2012-03-29 Thread Jonathan Ellis
On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote:
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.

 Unless your access pattern involves reading/writing the whole document each 
 time. In that case you're better off serializing the whole document and 
 storing it in a column as a byte[] without incurring the overhead of column 
 indexes. Right?

Hmm, not sure what you're thinking of there.

If you mean the index that's part of the row header for random
access within a row, then no, serializing to byte[] doesn't save you
anything.

If you mean secondary indexes, don't declare any if you don't want any. :)

Just telling C* to store a byte[] *will* be slightly lighter-weight
than giving it named columns, but we're talking negligible compared to
the overhead of actually moving the data on or off disk in the first
place.  Not even close to being worth giving up being able to deal
with your data from standard tools like cqlsh, IMO.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Document storage

2012-03-29 Thread Drew Kutcharian
Yes, I meant the row header index. What I have done is that I'm storing an 
object (i.e. UserProfile) where you read or write it as a whole (a user updates 
their user details in a single page in the UI). So I serialize that object into 
a binary JSON using SMILE format. I then compress it using Snappy on the client 
side. So as far as Cassandra cares it's storing a byte[].

Now on the client side, I'm using cassandra-cli with a custom type that knows 
how to turn a byte[] into a JSON text and back. The only issue was 
CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 
gets fixed, I'll get the best of both worlds.

Also advantages of this vs. the thrift based Super Column families are:

1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize 
and compression/decompression happens on the client nodes where there is plenty 
idle CPU time

2. Saving network bandwidth since I'm sending over a compressed byte[]


-- Drew



On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

 On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote:
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.
 
 Unless your access pattern involves reading/writing the whole document each 
 time. In that case you're better off serializing the whole document and 
 storing it in a column as a byte[] without incurring the overhead of column 
 indexes. Right?
 
 Hmm, not sure what you're thinking of there.
 
 If you mean the index that's part of the row header for random
 access within a row, then no, serializing to byte[] doesn't save you
 anything.
 
 If you mean secondary indexes, don't declare any if you don't want any. :)
 
 Just telling C* to store a byte[] *will* be slightly lighter-weight
 than giving it named columns, but we're talking negligible compared to
 the overhead of actually moving the data on or off disk in the first
 place.  Not even close to being worth giving up being able to deal
 with your data from standard tools like cqlsh, IMO.
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: Document storage

2012-03-29 Thread Ben McCann
Jonathan, I asked Brian about his REST
APIhttps://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas9C8Usand
he said he does not take the json objects and split them because the
client libraries do not agree on implementations.  This was exactly my
concern as well with this solution.  I would be perfectly happy to do it
this way instead of using JSON if it were standardized.  The reason I
suggested JSON is that it is standardized.  As far as I can tell, Cassandra
doesn't support maps and lists in a standardized way today, which is the
root of my problem.

-Ben


On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com wrote:

 Yes, I meant the row header index. What I have done is that I'm storing
 an object (i.e. UserProfile) where you read or write it as a whole (a user
 updates their user details in a single page in the UI). So I serialize that
 object into a binary JSON using SMILE format. I then compress it using
 Snappy on the client side. So as far as Cassandra cares it's storing a
 byte[].

 Now on the client side, I'm using cassandra-cli with a custom type that
 knows how to turn a byte[] into a JSON text and back. The only issue was
 CASSANDRA-4081 where assume doesn't work with custom types. If
 CASSANDRA-4081 gets fixed, I'll get the best of both worlds.

 Also advantages of this vs. the thrift based Super Column families are:

 1. Saving extra CPU usage on the Cassandra nodes. Since
 serialize/deserialize and compression/decompression happens on the client
 nodes where there is plenty idle CPU time

 2. Saving network bandwidth since I'm sending over a compressed byte[]


 -- Drew



 On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

  On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
 wrote:
  I think this is a much better approach because that gives you the
  ability to update or retrieve just parts of objects efficiently,
  rather than making column values just blobs with a bunch of special
  case logic to introspect them.  Which feels like a big step backwards
  to me.
 
  Unless your access pattern involves reading/writing the whole document
 each time. In that case you're better off serializing the whole document
 and storing it in a column as a byte[] without incurring the overhead of
 column indexes. Right?
 
  Hmm, not sure what you're thinking of there.
 
  If you mean the index that's part of the row header for random
  access within a row, then no, serializing to byte[] doesn't save you
  anything.
 
  If you mean secondary indexes, don't declare any if you don't want any.
 :)
 
  Just telling C* to store a byte[] *will* be slightly lighter-weight
  than giving it named columns, but we're talking negligible compared to
  the overhead of actually moving the data on or off disk in the first
  place.  Not even close to being worth giving up being able to deal
  with your data from standard tools like cqlsh, IMO.
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of DataStax, the source for professional Cassandra support
  http://www.datastax.com




Re: Document storage

2012-03-29 Thread Jonathan Ellis
On Thu, Mar 29, 2012 at 2:06 PM, Ben McCann b...@benmccann.com wrote:
 As far as I can tell, Cassandra
 doesn't support maps and lists in a standardized way today, which is the
 root of my problem.

I'm pretty serious about adding those for 1.2, for what that's worth.
(If you want to jump in and help code that up, so much the better.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Document storage

2012-03-29 Thread Brian O'Neill
Jonathan, 

I was actually going to take this up with Nate McCall a few weeks back.  I
think it might make sense to get the client development community together
(Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)

I agree whole-heartedly that it shouldn't go into the database for all the
reasons you point out.

If we can all decide on some standards for data storage (e.g. composite
types), indexing strategies, etc.  We can provide higher-level functions
through the client libraries and also provide interoperability between
them.  (without bloating Cassandra)

CCing Nate.  Nate, thoughts?
I wouldn't mind coordinating/facilitating the conversation.  If we know
who should be involved.

-brian

 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/







On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote:

Jonathan, I asked Brian about his REST
APIhttps://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
9C8Usand
he said he does not take the json objects and split them because the
client libraries do not agree on implementations.  This was exactly my
concern as well with this solution.  I would be perfectly happy to do it
this way instead of using JSON if it were standardized.  The reason I
suggested JSON is that it is standardized.  As far as I can tell,
Cassandra
doesn't support maps and lists in a standardized way today, which is the
root of my problem.

-Ben


On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com wrote:

 Yes, I meant the row header index. What I have done is that I'm
storing
 an object (i.e. UserProfile) where you read or write it as a whole (a
user
 updates their user details in a single page in the UI). So I serialize
that
 object into a binary JSON using SMILE format. I then compress it using
 Snappy on the client side. So as far as Cassandra cares it's storing a
 byte[].

 Now on the client side, I'm using cassandra-cli with a custom type that
 knows how to turn a byte[] into a JSON text and back. The only issue was
 CASSANDRA-4081 where assume doesn't work with custom types. If
 CASSANDRA-4081 gets fixed, I'll get the best of both worlds.

 Also advantages of this vs. the thrift based Super Column families are:

 1. Saving extra CPU usage on the Cassandra nodes. Since
 serialize/deserialize and compression/decompression happens on the
client
 nodes where there is plenty idle CPU time

 2. Saving network bandwidth since I'm sending over a compressed byte[]


 -- Drew



 On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

  On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
 wrote:
  I think this is a much better approach because that gives you the
  ability to update or retrieve just parts of objects efficiently,
  rather than making column values just blobs with a bunch of special
  case logic to introspect them.  Which feels like a big step
backwards
  to me.
 
  Unless your access pattern involves reading/writing the whole
document
 each time. In that case you're better off serializing the whole document
 and storing it in a column as a byte[] without incurring the overhead of
 column indexes. Right?
 
  Hmm, not sure what you're thinking of there.
 
  If you mean the index that's part of the row header for random
  access within a row, then no, serializing to byte[] doesn't save you
  anything.
 
  If you mean secondary indexes, don't declare any if you don't want
any.
 :)
 
  Just telling C* to store a byte[] *will* be slightly lighter-weight
  than giving it named columns, but we're talking negligible compared to
  the overhead of actually moving the data on or off disk in the first
  place.  Not even close to being worth giving up being able to deal
  with your data from standard tools like cqlsh, IMO.
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of DataStax, the source for professional Cassandra support
  http://www.datastax.com






Re: Document storage

2012-03-29 Thread Ben McCann
Thanks Jonathan.  The only reason I suggested JSON was because it already
has support for lists.  Native support for lists in Cassandra would more
than satisfy me.  Are there any existing proposals or a bug I can follow?
 I'm not familiar with the Cassandra codebase, so I'm not entirely sure how
helpful I can be, but I'd certainly be interested in taking a look to see
what's required.

-Ben


On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill b...@alumni.brown.eduwrote:

 Jonathan,

 I was actually going to take this up with Nate McCall a few weeks back.  I
 think it might make sense to get the client development community together
 (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)

 I agree whole-heartedly that it shouldn't go into the database for all the
 reasons you point out.

 If we can all decide on some standards for data storage (e.g. composite
 types), indexing strategies, etc.  We can provide higher-level functions
 through the client libraries and also provide interoperability between
 them.  (without bloating Cassandra)

 CCing Nate.  Nate, thoughts?
 I wouldn't mind coordinating/facilitating the conversation.  If we know
 who should be involved.

 -brian

 
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/







 On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote:

 Jonathan, I asked Brian about his REST
 API
 https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
 9C8Usand
 he said he does not take the json objects and split them because the
 client libraries do not agree on implementations.  This was exactly my
 concern as well with this solution.  I would be perfectly happy to do it
 this way instead of using JSON if it were standardized.  The reason I
 suggested JSON is that it is standardized.  As far as I can tell,
 Cassandra
 doesn't support maps and lists in a standardized way today, which is the
 root of my problem.
 
 -Ben
 
 
 On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com
 wrote:
 
  Yes, I meant the row header index. What I have done is that I'm
 storing
  an object (i.e. UserProfile) where you read or write it as a whole (a
 user
  updates their user details in a single page in the UI). So I serialize
 that
  object into a binary JSON using SMILE format. I then compress it using
  Snappy on the client side. So as far as Cassandra cares it's storing a
  byte[].
 
  Now on the client side, I'm using cassandra-cli with a custom type that
  knows how to turn a byte[] into a JSON text and back. The only issue was
  CASSANDRA-4081 where assume doesn't work with custom types. If
  CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
 
  Also advantages of this vs. the thrift based Super Column families are:
 
  1. Saving extra CPU usage on the Cassandra nodes. Since
  serialize/deserialize and compression/decompression happens on the
 client
  nodes where there is plenty idle CPU time
 
  2. Saving network bandwidth since I'm sending over a compressed byte[]
 
 
  -- Drew
 
 
 
  On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
 
   On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
  wrote:
   I think this is a much better approach because that gives you the
   ability to update or retrieve just parts of objects efficiently,
   rather than making column values just blobs with a bunch of special
   case logic to introspect them.  Which feels like a big step
 backwards
   to me.
  
   Unless your access pattern involves reading/writing the whole
 document
  each time. In that case you're better off serializing the whole document
  and storing it in a column as a byte[] without incurring the overhead of
  column indexes. Right?
  
   Hmm, not sure what you're thinking of there.
  
   If you mean the index that's part of the row header for random
   access within a row, then no, serializing to byte[] doesn't save you
   anything.
  
   If you mean secondary indexes, don't declare any if you don't want
 any.
  :)
  
   Just telling C* to store a byte[] *will* be slightly lighter-weight
   than giving it named columns, but we're talking negligible compared to
   the overhead of actually moving the data on or off disk in the first
   place.  Not even close to being worth giving up being able to deal
   with your data from standard tools like cqlsh, IMO.
  
   --
   Jonathan Ellis
   Project Chair, Apache Cassandra
   co-founder of DataStax, the source for professional Cassandra support
   http://www.datastax.com
 
 





Re: Document storage

2012-03-29 Thread Brian O'Neill

Jonathan,

We store JSON as our column values.  I'd love to see support for maps and
lists.  If I get some time this weekend, I'll take a look to see what is
required.  I doesn't seem like it would be that hard.

-brian

 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/







On 3/29/12 3:18 PM, Jonathan Ellis jbel...@gmail.com wrote:

On Thu, Mar 29, 2012 at 2:06 PM, Ben McCann b...@benmccann.com wrote:
 As far as I can tell, Cassandra
 doesn't support maps and lists in a standardized way today, which is the
 root of my problem.

I'm pretty serious about adding those for 1.2, for what that's worth.
(If you want to jump in and help code that up, so much the better.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com




Re: Document storage

2012-03-29 Thread Jonathan Ellis
I kind of hijacked
https://issues.apache.org/jira/browse/CASSANDRA-3647 (Sylvain
suggests we start with (non-nested) lists, maps, and sets. I agree
that this is a great 80/20 approach to the problem) but we could
split it out to another ticket.

On Thu, Mar 29, 2012 at 2:24 PM, Ben McCann b...@benmccann.com wrote:
 Thanks Jonathan.  The only reason I suggested JSON was because it already
 has support for lists.  Native support for lists in Cassandra would more
 than satisfy me.  Are there any existing proposals or a bug I can follow?
  I'm not familiar with the Cassandra codebase, so I'm not entirely sure how
 helpful I can be, but I'd certainly be interested in taking a look to see
 what's required.

 -Ben


 On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill b...@alumni.brown.eduwrote:

 Jonathan,

 I was actually going to take this up with Nate McCall a few weeks back.  I
 think it might make sense to get the client development community together
 (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)

 I agree whole-heartedly that it shouldn't go into the database for all the
 reasons you point out.

 If we can all decide on some standards for data storage (e.g. composite
 types), indexing strategies, etc.  We can provide higher-level functions
 through the client libraries and also provide interoperability between
 them.  (without bloating Cassandra)

 CCing Nate.  Nate, thoughts?
 I wouldn't mind coordinating/facilitating the conversation.  If we know
 who should be involved.

 -brian

 
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/







 On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote:

 Jonathan, I asked Brian about his REST
 API
 https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
 9C8Usand
 he said he does not take the json objects and split them because the
 client libraries do not agree on implementations.  This was exactly my
 concern as well with this solution.  I would be perfectly happy to do it
 this way instead of using JSON if it were standardized.  The reason I
 suggested JSON is that it is standardized.  As far as I can tell,
 Cassandra
 doesn't support maps and lists in a standardized way today, which is the
 root of my problem.
 
 -Ben
 
 
 On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com
 wrote:
 
  Yes, I meant the row header index. What I have done is that I'm
 storing
  an object (i.e. UserProfile) where you read or write it as a whole (a
 user
  updates their user details in a single page in the UI). So I serialize
 that
  object into a binary JSON using SMILE format. I then compress it using
  Snappy on the client side. So as far as Cassandra cares it's storing a
  byte[].
 
  Now on the client side, I'm using cassandra-cli with a custom type that
  knows how to turn a byte[] into a JSON text and back. The only issue was
  CASSANDRA-4081 where assume doesn't work with custom types. If
  CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
 
  Also advantages of this vs. the thrift based Super Column families are:
 
  1. Saving extra CPU usage on the Cassandra nodes. Since
  serialize/deserialize and compression/decompression happens on the
 client
  nodes where there is plenty idle CPU time
 
  2. Saving network bandwidth since I'm sending over a compressed byte[]
 
 
  -- Drew
 
 
 
  On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
 
   On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
  wrote:
   I think this is a much better approach because that gives you the
   ability to update or retrieve just parts of objects efficiently,
   rather than making column values just blobs with a bunch of special
   case logic to introspect them.  Which feels like a big step
 backwards
   to me.
  
   Unless your access pattern involves reading/writing the whole
 document
  each time. In that case you're better off serializing the whole document
  and storing it in a column as a byte[] without incurring the overhead of
  column indexes. Right?
  
   Hmm, not sure what you're thinking of there.
  
   If you mean the index that's part of the row header for random
   access within a row, then no, serializing to byte[] doesn't save you
   anything.
  
   If you mean secondary indexes, don't declare any if you don't want
 any.
  :)
  
   Just telling C* to store a byte[] *will* be slightly lighter-weight
   than giving it named columns, but we're talking negligible compared to
   the overhead of actually moving the data on or off disk in the first
   place.  Not even close to being worth giving up being able to deal
   with your data from standard tools like cqlsh, IMO.
  
   --
   Jonathan Ellis
   Project Chair, Apache Cassandra
   co-founder of DataStax, the source for professional Cassandra support
   http://www.datastax.com
 
 






-- 

Re: Document storage

2012-03-29 Thread Ben McCann
Cool.  How were you thinking we should store the data?  As a stanardized
composite column (e.g. potentially a list as [fieldName, TimeUUID]:
fieldValue and a set as  [fieldName,  fieldValue ]:)?  Or as a new
column type?


On Thu, Mar 29, 2012 at 12:35 PM, Jonathan Ellis jbel...@gmail.com wrote:

 I kind of hijacked
 https://issues.apache.org/jira/browse/CASSANDRA-3647 (Sylvain
 suggests we start with (non-nested) lists, maps, and sets. I agree
 that this is a great 80/20 approach to the problem) but we could
 split it out to another ticket.

 On Thu, Mar 29, 2012 at 2:24 PM, Ben McCann b...@benmccann.com wrote:
  Thanks Jonathan.  The only reason I suggested JSON was because it already
  has support for lists.  Native support for lists in Cassandra would more
  than satisfy me.  Are there any existing proposals or a bug I can follow?
   I'm not familiar with the Cassandra codebase, so I'm not entirely sure
 how
  helpful I can be, but I'd certainly be interested in taking a look to see
  what's required.
 
  -Ben
 
 
  On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill b...@alumni.brown.edu
 wrote:
 
  Jonathan,
 
  I was actually going to take this up with Nate McCall a few weeks back.
  I
  think it might make sense to get the client development community
 together
  (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)
 
  I agree whole-heartedly that it shouldn't go into the database for all
 the
  reasons you point out.
 
  If we can all decide on some standards for data storage (e.g. composite
  types), indexing strategies, etc.  We can provide higher-level functions
  through the client libraries and also provide interoperability between
  them.  (without bloating Cassandra)
 
  CCing Nate.  Nate, thoughts?
  I wouldn't mind coordinating/facilitating the conversation.  If we know
  who should be involved.
 
  -brian
 
  
  Brian O'Neill
  Lead Architect, Software Development
  Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
  p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
  blog: http://brianoneill.blogspot.com/
 
 
 
 
 
 
 
  On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote:
 
  Jonathan, I asked Brian about his REST
  API
  https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
  9C8Usand
  he said he does not take the json objects and split them because the
  client libraries do not agree on implementations.  This was exactly my
  concern as well with this solution.  I would be perfectly happy to do
 it
  this way instead of using JSON if it were standardized.  The reason I
  suggested JSON is that it is standardized.  As far as I can tell,
  Cassandra
  doesn't support maps and lists in a standardized way today, which is
 the
  root of my problem.
  
  -Ben
  
  
  On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com
  wrote:
  
   Yes, I meant the row header index. What I have done is that I'm
  storing
   an object (i.e. UserProfile) where you read or write it as a whole (a
  user
   updates their user details in a single page in the UI). So I
 serialize
  that
   object into a binary JSON using SMILE format. I then compress it
 using
   Snappy on the client side. So as far as Cassandra cares it's storing
 a
   byte[].
  
   Now on the client side, I'm using cassandra-cli with a custom type
 that
   knows how to turn a byte[] into a JSON text and back. The only issue
 was
   CASSANDRA-4081 where assume doesn't work with custom types. If
   CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
  
   Also advantages of this vs. the thrift based Super Column families
 are:
  
   1. Saving extra CPU usage on the Cassandra nodes. Since
   serialize/deserialize and compression/decompression happens on the
  client
   nodes where there is plenty idle CPU time
  
   2. Saving network bandwidth since I'm sending over a compressed
 byte[]
  
  
   -- Drew
  
  
  
   On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
  
On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com
   wrote:
I think this is a much better approach because that gives you the
ability to update or retrieve just parts of objects efficiently,
rather than making column values just blobs with a bunch of
 special
case logic to introspect them.  Which feels like a big step
  backwards
to me.
   
Unless your access pattern involves reading/writing the whole
  document
   each time. In that case you're better off serializing the whole
 document
   and storing it in a column as a byte[] without incurring the
 overhead of
   column indexes. Right?
   
Hmm, not sure what you're thinking of there.
   
If you mean the index that's part of the row header for random
access within a row, then no, serializing to byte[] doesn't save
 you
anything.
   
If you mean secondary indexes, don't declare any if you don't want
  any.
   :)
   
Just telling C* to store a byte[] *will* be slightly lighter-weight

Document storage

2012-03-28 Thread Ben McCann
Hi,

I was wondering if it would be interesting to add some type of
document-oriented data type.

I've found it somewhat awkward to store document-oriented data in Cassandra
today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
Cassandra cannot differentiate it from any other string or byte array.
 However, if my column validation_class could be a JsonType that would
allow tools to potentially do more interesting introspection on the column
value.  E.g. bug 3647
https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
supporting arbitrarily nested documents in CQL.  Running a
query against the JSON column in Pig is possible as well, but again in this
use case it would be helpful to be able to encode in column metadata that
the column is stored as JSON.  For debugging, running nightly reports, etc.
it would be quite useful compared to the opaque string and byte array types
we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
since they would be more space efficient.  However, they would also be a
bit more difficult to implement because of the extra typing information
they provide.  I'm hoping with Cassandra 1.0's addition of compression that
storing JSON is not too inefficient.

Would there be interest in adding a JsonType?  I could look at putting a
patch together.

Thanks,
Ben


Re: Document storage

2012-03-28 Thread Ben McCann
Any thoughts?  I'd like to submit a patch, but only if it will be accepted.

Thanks,
Ben


On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
 store it, but Cassandra cannot differentiate it from any other string or
 byte array.  However, if my column validation_class could be a JsonType
 that would allow tools to potentially do more interesting introspection on
 the column value.  E.g. bug 
 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for 
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
  Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben




Re: Document storage

2012-03-28 Thread Jeremy Hanna
I don't speak for the project, but you might give it a day or two for people to 
respond and/or perhaps create a jira ticket.  Seems like that's a reasonable 
data type that would get some traction - a json type.  However, what would 
validation look like?  That's one of the main reasons there are the data types 
and validators, in order to validate on insert.

On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:

 Any thoughts?  I'd like to submit a patch, but only if it will be accepted.
 
 Thanks,
 Ben
 
 
 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:
 
 Hi,
 
 I was wondering if it would be interesting to add some type of
 document-oriented data type.
 
 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
 store it, but Cassandra cannot differentiate it from any other string or
 byte array.  However, if my column validation_class could be a JsonType
 that would allow tools to potentially do more interesting introspection on
 the column value.  E.g. bug 
 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for 
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.
 
 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.
 
 Thanks,
 Ben
 
 



Re: Document storage

2012-03-28 Thread Jeremiah Jordan
Sounds interesting to me.  I looked into adding protocol buffer support at one 
point, and it didn't look like it would be too much work.  The tricky part was 
I also wanted to add indexing support for attributes of the inserted protocol 
buffers.  That looked a little trickier, but still not impossible.  Though 
other stuff came up and I never got around to actually writing any code.
JSON support would be nice, especially if you figured out how to get built in 
indexing of the attributes inside the JSON to work =).

-Jeremiah

On Mar 28, 2012, at 10:58 AM, Ben McCann wrote:

 Hi,
 
 I was wondering if it would be interesting to add some type of
 document-oriented data type.
 
 I've found it somewhat awkward to store document-oriented data in Cassandra
 today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
 Cassandra cannot differentiate it from any other string or byte array.
 However, if my column validation_class could be a JsonType that would
 allow tools to potentially do more interesting introspection on the column
 value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.
 
 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.
 
 Thanks,
 Ben



Re: Document storage

2012-03-28 Thread Tatu Saloranta
On Wed, Mar 28, 2012 at 6:59 PM, Jeremiah Jordan
jeremiah.jor...@morningstar.com wrote:
 Sounds interesting to me.  I looked into adding protocol buffer support at 
 one point, and it didn't look like it would be too much work.  The tricky 
 part was I also wanted to add indexing support for attributes of the inserted 
 protocol buffers.  That looked a little trickier, but still not impossible.  
 Though other stuff came up and I never got around to actually writing any 
 code.
 JSON support would be nice, especially if you figured out how to get built in 
 indexing of the attributes inside the JSON to work =).

Also, for whatever it's worth, it should be trivial to add support for
Smile (binary JSON serialization):
http://wiki.fasterxml.com/SmileFormatSpec
since its logical data structure is pure JSON, no extensions or
subsetting. The main Java impl is by Jackson project, but there is
also a C codec (https://github.com/pierre/libsmile), and prototypes
for PHP and Ruby bindings as well.
But for all data it's bit faster, bit more compact; about 30% for
individual items, but more (40 - 70%) for data sequences (due to
optional back-referencing).

JSON and Smile can be auto-detected from first 4 bytes or so, reliably
and efficiently, so one should be able to add this either
transparently or explicitly.
One could even transcode things on the fly -- store as Smile, expose
filtered results as JSON (and accept JSON or both). This could reduce
storage cost while keep the benefits of flexible data format.

-+ Tatu +-


Re: Document storage

2012-03-28 Thread Edward Capriolo
Some work I did stores JSON blobs in columns. The question on JSON
type is how to sort it.

On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
jeremy.hanna1...@gmail.com wrote:
 I don't speak for the project, but you might give it a day or two for people 
 to respond and/or perhaps create a jira ticket.  Seems like that's a 
 reasonable data type that would get some traction - a json type.  However, 
 what would validation look like?  That's one of the main reasons there are 
 the data types and validators, in order to validate on insert.

 On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:

 Any thoughts?  I'd like to submit a patch, but only if it will be accepted.

 Thanks,
 Ben


 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
 store it, but Cassandra cannot differentiate it from any other string or
 byte array.  However, if my column validation_class could be a JsonType
 that would allow tools to potentially do more interesting introspection on
 the column value.  E.g. bug 
 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for 
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben





Re: Document storage

2012-03-28 Thread Ben McCann
I don't imagine sort is a meaningful operation on JSON data.  As long as
the sorting is consistent I would think that should be sufficient.


On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Some work I did stores JSON blobs in columns. The question on JSON
 type is how to sort it.

 On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
 jeremy.hanna1...@gmail.com wrote:
  I don't speak for the project, but you might give it a day or two for
 people to respond and/or perhaps create a jira ticket.  Seems like that's a
 reasonable data type that would get some traction - a json type.  However,
 what would validation look like?  That's one of the main reasons there are
 the data types and validators, in order to validate on insert.
 
  On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
 
  Any thoughts?  I'd like to submit a patch, but only if it will be
 accepted.
 
  Thanks,
  Ben
 
 
  On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:
 
  Hi,
 
  I was wondering if it would be interesting to add some type of
  document-oriented data type.
 
  I've found it somewhat awkward to store document-oriented data in
  Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
  store it, but Cassandra cannot differentiate it from any other string
 or
  byte array.  However, if my column validation_class could be a JsonType
  that would allow tools to potentially do more interesting
 introspection on
  the column value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting
 arbitrarily nested documents in CQL.  Running a
  query against the JSON column in Pig is possible as well, but again in
 this
  use case it would be helpful to be able to encode in column metadata
 that
  the column is stored as JSON.  For debugging, running nightly reports,
 etc.
  it would be quite useful compared to the opaque string and byte array
 types
  we have today.  JSON is appealing because it would be easy to
 implement.
  Something like Thrift or Protocol Buffers would actually be interesting
  since they would be more space efficient.  However, they would also be
 a
  bit more difficult to implement because of the extra typing information
  they provide.  I'm hoping with Cassandra 1.0's addition of compression
 that
  storing JSON is not too inefficient.
 
  Would there be interest in adding a JsonType?  I could look at putting
 a
  patch together.
 
  Thanks,
  Ben