Re: Document storage
Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Hm - but isn't this exactly the point? You don't want to move data off disk. But decomposing into columns will lead to more of that: - Total amount of serialized data is (in most cases a lot) larger than protobuffed / compressed version - If you do selective updates the document will be scattered over multiple ssts plus if you do sliced reads you can't optimize reads as opposed to the single column version that when updated is automatically superseding older versions so most reads will hit only one sst All these reads make the hot dataset. If it fits the page cache your fine. If it doesn't you need to buy more iron. Really could not resist because your statement seems to be contrary to all our tests / learnings. Cheers, Daniel From dev list: Re: Document storage On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
Do we also need to consider the client API? If we don't adjust thrift, the client just gets bytes right? The client is on their own to marshal back into a structure. In this case, it seems like we would want to chose a standard that is efficient and for which there are common libraries. Protobuf seems to fit the bill here. Or do we pass back some other structure? (Native lists/maps? JSON strings?) Do we ignore sorting/comparators? (similar to SOLR, I'm not sure people have defined a good sort for multi-valued items) -brian Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ On 3/30/12 12:01 PM, Daniel Doubleday daniel.double...@gmx.net wrote: Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Hm - but isn't this exactly the point? You don't want to move data off disk. But decomposing into columns will lead to more of that: - Total amount of serialized data is (in most cases a lot) larger than protobuffed / compressed version - If you do selective updates the document will be scattered over multiple ssts plus if you do sliced reads you can't optimize reads as opposed to the single column version that when updated is automatically superseding older versions so most reads will hit only one sst All these reads make the hot dataset. If it fits the page cache your fine. If it doesn't you need to buy more iron. Really could not resist because your statement seems to be contrary to all our tests / learnings. Cheers, Daniel From dev list: Re: Document storage On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday daniel.double...@gmx.net wrote: But decomposing into columns will lead to more of that: - Total amount of serialized data is (in most cases a lot) larger than protobuffed / compressed version At least with sstable compression, I would expect the difference to not be too big in practice. - If you do selective updates the document will be scattered over multiple ssts plus if you do sliced reads you can't optimize reads as opposed to the single column version that when updated is automatically superseding older versions so most reads will hit only one sst But if you need to do selective updates, then a blob just doesn't work so that comparison is moot. Now I don't think anyone pretended that you should never use blobs (whether that's protobuffed, jsoned, ...). If you don't need selected updates and having something as compact as possible on disk make a important difference for you, sure, do use blobs. The only argument is that you can already do that without any change to the core. What we are saying is that for the case where you care more about schema flexibility (being able to do selective updates, to index on some subpart, etc...) then we think that something like the map and list idea of CASSANDRA-3647 will probably be a more natural fit to the current CQL API. -- Sylvain All these reads make the hot dataset. If it fits the page cache your fine. If it doesn't you need to buy more iron. Really could not resist because your statement seems to be contrary to all our tests / learnings. Cheers, Daniel From dev list: Re: Document storage On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
If you don't need selected updates and having something as compact as possible on disk make a important difference for you, sure, do use blobs. The only argument is that you can already do that without any change to the core. The thing that we can't do today without changes to the core is index on subparts of some document format like Protobuf/JSON/etc. If cassandra were to understand one of these formats, it could remove the need for manual management of an index. On Fri, Mar 30, 2012 at 10:23 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday daniel.double...@gmx.net wrote: But decomposing into columns will lead to more of that: - Total amount of serialized data is (in most cases a lot) larger than protobuffed / compressed version At least with sstable compression, I would expect the difference to not be too big in practice. - If you do selective updates the document will be scattered over multiple ssts plus if you do sliced reads you can't optimize reads as opposed to the single column version that when updated is automatically superseding older versions so most reads will hit only one sst But if you need to do selective updates, then a blob just doesn't work so that comparison is moot. Now I don't think anyone pretended that you should never use blobs (whether that's protobuffed, jsoned, ...). If you don't need selected updates and having something as compact as possible on disk make a important difference for you, sure, do use blobs. The only argument is that you can already do that without any change to the core. What we are saying is that for the case where you care more about schema flexibility (being able to do selective updates, to index on some subpart, etc...) then we think that something like the map and list idea of CASSANDRA-3647 will probably be a more natural fit to the current CQL API. -- Sylvain All these reads make the hot dataset. If it fits the page cache your fine. If it doesn't you need to buy more iron. Really could not resist because your statement seems to be contrary to all our tests / learnings. Cheers, Daniel From dev list: Re: Document storage On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
I'm actually doing something almost the same. I serialize my objects into byte[] using Jackson's SMILE format, then compress it using Snappy then store the byte[] in Cassandra. I actually created a simple Cassandra Type for this but I hit a wall with cassandra-cli: https://issues.apache.org/jira/browse/CASSANDRA-4081 Please vote on the JIRA if you are interested. Validation is pretty simple, you just need to read the value and parse it using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;) -- Drew On Mar 28, 2012, at 9:28 PM, Ben McCann wrote: I don't imagine sort is a meaningful operation on JSON data. As long as the sorting is consistent I would think that should be sufficient. On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Some work I did stores JSON blobs in columns. The question on JSON type is how to sort it. On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
Sounds awesome Drew. Mind sharing your custom type? I just wrote a basic JSON type and did the validation the same way you did, but I don't have any SMILE support yet. It seems that if your type were committed to the Cassandra codebase then the issue you ran into of the CLI only supporting built-in types would no longer be a problem for you (though fixing the issue anyway would be good and I voted for it). Btw, any reason you compress it with Snappy yourself instead of just setting sstable_compression to SnappyCompressorhttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compressionand letting Cassandra do that part? -Ben On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian d...@venarc.com wrote: I'm actually doing something almost the same. I serialize my objects into byte[] using Jackson's SMILE format, then compress it using Snappy then store the byte[] in Cassandra. I actually created a simple Cassandra Type for this but I hit a wall with cassandra-cli: https://issues.apache.org/jira/browse/CASSANDRA-4081 Please vote on the JIRA if you are interested. Validation is pretty simple, you just need to read the value and parse it using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;) -- Drew On Mar 28, 2012, at 9:28 PM, Ben McCann wrote: I don't imagine sort is a meaningful operation on JSON data. As long as the sorting is consistent I would think that should be sufficient. On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Some work I did stores JSON blobs in columns. The question on JSON type is how to sort it. On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
Is there a reason you would prefer a JSONType over CASSANDRA-3647? It would seem the only thing a JSON type offers you is validation. 3647 takes it much further by deconstructing a JSON document using composite columns to flatten the document out, with the ability to access and update portions of the document (as well as reconstruct it). On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben -- http://twitter.com/tjake
Re: Document storage
Could you explain further how I would use CASSANDRA-3647? There's still very little documentation on composite columns and it was not clear to me whether they could be used to store document oriented data. Say for example that I had a document like: user: { firstName: 'ben', skills: ['java', 'javascript', 'html'], education { school: 'cmu', major: 'computer science' } } How would I flatten this to be stored and then reconstruct the document? On Thu, Mar 29, 2012 at 5:44 AM, Jake Luciani jak...@gmail.com wrote: Is there a reason you would prefer a JSONType over CASSANDRA-3647? It would seem the only thing a JSON type offers you is validation. 3647 takes it much further by deconstructing a JSON document using composite columns to flatten the document out, with the ability to access and update portions of the document (as well as reconstruct it). On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben -- http://twitter.com/tjake
Re: Document storage
Ben, You can create a materialized path for each field in the document: { [user, firstName]: ben, [user, skills, TimeUUID]: java, [user, skills, TimeUUID]: javascript, [user, skills, TimeUUID]: html, [user, education, school]: cmu, [user, education, major]: computer science } This way each field could be independently updated, and you can take sub-document slices with queries such as give me everything under user/skills. Rick On Thursday, March 29, 2012 at 7:27 AM, Ben McCann wrote: Could you explain further how I would use CASSANDRA-3647? There's still very little documentation on composite columns and it was not clear to me whether they could be used to store document oriented data. Say for example that I had a document like: user: { firstName: 'ben', skills: ['java', 'javascript', 'html'], education { school: 'cmu', major: 'computer science' } } How would I flatten this to be stored and then reconstruct the document? On Thu, Mar 29, 2012 at 5:44 AM, Jake Luciani jak...@gmail.com (mailto:jak...@gmail.com) wrote: Is there a reason you would prefer a JSONType over CASSANDRA-3647? It would seem the only thing a JSON type offers you is validation. 3647 takes it much further by deconstructing a JSON document using composite columns to flatten the document out, with the ability to access and update portions of the document (as well as reconstruct it). On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com (mailto:b...@benmccann.com) wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben -- http://twitter.com/tjake
RE: Document storage
Its not clear what 3647 actually is, there is no code attached, and no real example in it. Aside from that, the reason this would be useful to me (if we could get indexing of attributes working), is that I already have my data in JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to break it up into columns to insert, and re-assemble into columns to read. Also, until we get multiple slice range reads, I can't read two different structures out of one row without getting all the other stuff between them, unless there are only two columns and I read them using column names not slices. As it is right now I have to maintain custom indexes on all my attributes to be able to put ProtoBuff's into columns, and get some searching on them. It would be nice if I could drop all my custom indexing code and just tell Cassandra, hey, index column.attr1.subattr2. -Jeremiah From: Jake Luciani [jak...@gmail.com] Sent: Thursday, March 29, 2012 7:44 AM To: dev@cassandra.apache.org Subject: Re: Document storage Is there a reason you would prefer a JSONType over CASSANDRA-3647? It would seem the only thing a JSON type offers you is validation. 3647 takes it much further by deconstructing a JSON document using composite columns to flatten the document out, with the ability to access and update portions of the document (as well as reconstruct it). On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben -- http://twitter.com/tjake
RE: Document storage
Its not clear what 3647 actually is, there is no code attached, and no real example in it. Aside from that, the reason this would be useful to me (if we could get indexing of attributes working), is that I already have my data in JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to break it up into columns to insert, and re-assemble into columns to read. Also, until we get multiple slice range reads, I can't read two different structures out of one row without getting all the other stuff between them, unless there are only two columns and I read them using column names not slices. As it is right now I have to maintain custom indexes on all my attributes to be able to put ProtoBuff into From: Jake Luciani [jak...@gmail.com] Sent: Thursday, March 29, 2012 7:44 AM To: dev@cassandra.apache.org Subject: Re: Document storage Is there a reason you would prefer a JSONType over CASSANDRA-3647? It would seem the only thing a JSON type offers you is validation. 3647 takes it much further by deconstructing a JSON document using composite columns to flatten the document out, with the ability to access and update portions of the document (as well as reconstruct it). On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben -- http://twitter.com/tjake
Re: Document storage
Would there be interest in adding a JsonType? What about checking that data inserted into a JsonType is valid JSON? How would you do it, and would the overhead be something we are concerned about, especially if the JSON string is large?
Re: Document storage
Creating materialized paths may well be a possible solution. If that were the solution the community were to agree upon then I would like it to be a standardized and well-documented best practice. I asked how to store a list of values on the user listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html and no one suggested [fieldName, TimeUUID]: fieldValue. It would be a huge pain right now to create materialized paths like this for each of my objects, so client library support would definitely be needed. And the client libraries should agree. If Astyanax and lazyboy both add support for materialized path and I write an object to Cassandra with Astyanax, then I should be able to read it back with lazyboy. The benefit of using JSON/SMILE is that it's very clear that there's exactly one way to serialize and deserialize the data and it's very easy. It's not clear to me that this is true using materialized paths. On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson tpatter...@datastax.comwrote: Would there be interest in adding a JsonType? What about checking that data inserted into a JsonType is valid JSON? How would you do it, and would the overhead be something we are concerned about, especially if the JSON string is large?
Re: Document storage
The issue with these super complex types is to do anything useful with them you would either need scanners or co processors. As its stands right now complex data like json is fairly opaque to Cassandra. Getting cassandra to natively speak protobuffs or whatever flavor of the week serialization framework is hip right now we make the codebase very large. How is that field sorted? How is it indexed? This is starting to go very far against the schema-less nosql grain. Where does this end up users wanting to store binary XML index it and feed cassandra XPath queries? On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann b...@benmccann.com wrote: Creating materialized paths may well be a possible solution. If that were the solution the community were to agree upon then I would like it to be a standardized and well-documented best practice. I asked how to store a list of values on the user listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html and no one suggested [fieldName, TimeUUID]: fieldValue. It would be a huge pain right now to create materialized paths like this for each of my objects, so client library support would definitely be needed. And the client libraries should agree. If Astyanax and lazyboy both add support for materialized path and I write an object to Cassandra with Astyanax, then I should be able to read it back with lazyboy. The benefit of using JSON/SMILE is that it's very clear that there's exactly one way to serialize and deserialize the data and it's very easy. It's not clear to me that this is true using materialized paths. On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson tpatter...@datastax.comwrote: Would there be interest in adding a JsonType? What about checking that data inserted into a JsonType is valid JSON? How would you do it, and would the overhead be something we are concerned about, especially if the JSON string is large?
Re: Document storage
On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Its not clear what 3647 actually is, there is no code attached, and no real example in it. Aside from that, the reason this would be useful to me (if we could get indexing of attributes working), is that I already have my data in JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to break it up into columns to insert, and re-assemble into columns to read. I don't understand the problem. Assuming Cassandra support for maps and lists, I could write a Python module that takes json (or thrift, or protobuf) objects and splits them into Cassandra rows by fields in a couple hours. I'm pretty sure this is essentially what Brian's REST api for Cassandra does now. I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
RE: Document storage
But it isn't special case logic. The current AbstractType and Indexing of Abstract types for the most part would already support this. Someone just has to write the code for JSONType or ProtoBuffType. The problem isn't writing the code to break objects up, the problem is encode/decode time. Encode/decode to thrift is already a significant portion of the time line in writing data, adding an object to column encode/decode on top of that makes it even longer. For a read heavy load that wants the JSON/Proto as the thing to be served to clients, an increase in the write time line to parse/index the blob is probably acceptable, so that you don't have to pay the re-assemble penalty every time you hit the database for that object. But, once we get multi range slicing, for the average case I think the break it up into multiple columns approach will be best for most people. That is the other problem I have with doing the break into columns thing right now. I have to either use Super Columns and not be able to index, so why did I break them up? Or I can't get multiple objects at once, with out pulling a huge slice from o1 start to o5 end and then throwing away the majority of the data I pulled back that doesn't belong to o1 and o5 -Jeremiah From: Jonathan Ellis [jbel...@gmail.com] Sent: Thursday, March 29, 2012 11:23 AM To: dev@cassandra.apache.org Subject: Re: Document storage On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Its not clear what 3647 actually is, there is no code attached, and no real example in it. Aside from that, the reason this would be useful to me (if we could get indexing of attributes working), is that I already have my data in JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to break it up into columns to insert, and re-assemble into columns to read. I don't understand the problem. Assuming Cassandra support for maps and lists, I could write a Python module that takes json (or thrift, or protobuf) objects and splits them into Cassandra rows by fields in a couple hours. I'm pretty sure this is essentially what Brian's REST api for Cassandra does now. I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
Hi Ben, Sure, there's nothing really to it, but I'll email it to you. As far as why I'm using Snappy on the type instead of sstable_compression is because when you set sstable_compression the compression happens on the Cassandra nodes and I see two advantages with my approach: 1. Saving extra CPU usage on the Cassandra nodes. Since compression/decompression can easily be done on the client nodes where there is plenty idle CPU time 2. Saving network bandwidth since you're sending over a compressed byte[] One thing to note about my approach is that when I define the schema in Cassandra, I define the columns as byte[] and not my custom type and I do all the conversion on the client side. -- Drew On Mar 29, 2012, at 12:04 AM, Ben McCann wrote: Sounds awesome Drew. Mind sharing your custom type? I just wrote a basic JSON type and did the validation the same way you did, but I don't have any SMILE support yet. It seems that if your type were committed to the Cassandra codebase then the issue you ran into of the CLI only supporting built-in types would no longer be a problem for you (though fixing the issue anyway would be good and I voted for it). Btw, any reason you compress it with Snappy yourself instead of just setting sstable_compression to SnappyCompressorhttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compressionand letting Cassandra do that part? -Ben On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian d...@venarc.com wrote: I'm actually doing something almost the same. I serialize my objects into byte[] using Jackson's SMILE format, then compress it using Snappy then store the byte[] in Cassandra. I actually created a simple Cassandra Type for this but I hit a wall with cassandra-cli: https://issues.apache.org/jira/browse/CASSANDRA-4081 Please vote on the JIRA if you are interested. Validation is pretty simple, you just need to read the value and parse it using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;) -- Drew On Mar 28, 2012, at 9:28 PM, Ben McCann wrote: I don't imagine sort is a meaningful operation on JSON data. As long as the sorting is consistent I would think that should be sufficient. On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Some work I did stores JSON blobs in columns. The question on JSON type is how to sort it. On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
I agree with Edward here, the simpler we keep the core the better. I think all the ser/deser and conversions should happen on the client side. -- Drew On Mar 29, 2012, at 8:36 AM, Edward Capriolo wrote: The issue with these super complex types is to do anything useful with them you would either need scanners or co processors. As its stands right now complex data like json is fairly opaque to Cassandra. Getting cassandra to natively speak protobuffs or whatever flavor of the week serialization framework is hip right now we make the codebase very large. How is that field sorted? How is it indexed? This is starting to go very far against the schema-less nosql grain. Where does this end up users wanting to store binary XML index it and feed cassandra XPath queries? On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann b...@benmccann.com wrote: Creating materialized paths may well be a possible solution. If that were the solution the community were to agree upon then I would like it to be a standardized and well-documented best practice. I asked how to store a list of values on the user listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html and no one suggested [fieldName, TimeUUID]: fieldValue. It would be a huge pain right now to create materialized paths like this for each of my objects, so client library support would definitely be needed. And the client libraries should agree. If Astyanax and lazyboy both add support for materialized path and I write an object to Cassandra with Astyanax, then I should be able to read it back with lazyboy. The benefit of using JSON/SMILE is that it's very clear that there's exactly one way to serialize and deserialize the data and it's very easy. It's not clear to me that this is true using materialized paths. On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson tpatter...@datastax.comwrote: Would there be interest in adding a JsonType? What about checking that data inserted into a JsonType is valid JSON? How would you do it, and would the overhead be something we are concerned about, especially if the JSON string is large?
Re: Document storage
I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? On Mar 29, 2012, at 9:23 AM, Jonathan Ellis wrote: On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Its not clear what 3647 actually is, there is no code attached, and no real example in it. Aside from that, the reason this would be useful to me (if we could get indexing of attributes working), is that I already have my data in JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to break it up into columns to insert, and re-assemble into columns to read. I don't understand the problem. Assuming Cassandra support for maps and lists, I could write a Python module that takes json (or thrift, or protobuf) objects and splits them into Cassandra rows by fields in a couple hours. I'm pretty sure this is essentially what Brian's REST api for Cassandra does now. I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
Yes, I meant the row header index. What I have done is that I'm storing an object (i.e. UserProfile) where you read or write it as a whole (a user updates their user details in a single page in the UI). So I serialize that object into a binary JSON using SMILE format. I then compress it using Snappy on the client side. So as far as Cassandra cares it's storing a byte[]. Now on the client side, I'm using cassandra-cli with a custom type that knows how to turn a byte[] into a JSON text and back. The only issue was CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 gets fixed, I'll get the best of both worlds. Also advantages of this vs. the thrift based Super Column families are: 1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize and compression/decompression happens on the client nodes where there is plenty idle CPU time 2. Saving network bandwidth since I'm sending over a compressed byte[] -- Drew On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote: On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
Jonathan, I asked Brian about his REST APIhttps://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas9C8Usand he said he does not take the json objects and split them because the client libraries do not agree on implementations. This was exactly my concern as well with this solution. I would be perfectly happy to do it this way instead of using JSON if it were standardized. The reason I suggested JSON is that it is standardized. As far as I can tell, Cassandra doesn't support maps and lists in a standardized way today, which is the root of my problem. -Ben On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com wrote: Yes, I meant the row header index. What I have done is that I'm storing an object (i.e. UserProfile) where you read or write it as a whole (a user updates their user details in a single page in the UI). So I serialize that object into a binary JSON using SMILE format. I then compress it using Snappy on the client side. So as far as Cassandra cares it's storing a byte[]. Now on the client side, I'm using cassandra-cli with a custom type that knows how to turn a byte[] into a JSON text and back. The only issue was CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 gets fixed, I'll get the best of both worlds. Also advantages of this vs. the thrift based Super Column families are: 1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize and compression/decompression happens on the client nodes where there is plenty idle CPU time 2. Saving network bandwidth since I'm sending over a compressed byte[] -- Drew On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote: On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
On Thu, Mar 29, 2012 at 2:06 PM, Ben McCann b...@benmccann.com wrote: As far as I can tell, Cassandra doesn't support maps and lists in a standardized way today, which is the root of my problem. I'm pretty serious about adding those for 1.2, for what that's worth. (If you want to jump in and help code that up, so much the better.) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
Jonathan, I was actually going to take this up with Nate McCall a few weeks back. I think it might make sense to get the client development community together (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.) I agree whole-heartedly that it shouldn't go into the database for all the reasons you point out. If we can all decide on some standards for data storage (e.g. composite types), indexing strategies, etc. We can provide higher-level functions through the client libraries and also provide interoperability between them. (without bloating Cassandra) CCing Nate. Nate, thoughts? I wouldn't mind coordinating/facilitating the conversation. If we know who should be involved. -brian Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote: Jonathan, I asked Brian about his REST APIhttps://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas 9C8Usand he said he does not take the json objects and split them because the client libraries do not agree on implementations. This was exactly my concern as well with this solution. I would be perfectly happy to do it this way instead of using JSON if it were standardized. The reason I suggested JSON is that it is standardized. As far as I can tell, Cassandra doesn't support maps and lists in a standardized way today, which is the root of my problem. -Ben On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com wrote: Yes, I meant the row header index. What I have done is that I'm storing an object (i.e. UserProfile) where you read or write it as a whole (a user updates their user details in a single page in the UI). So I serialize that object into a binary JSON using SMILE format. I then compress it using Snappy on the client side. So as far as Cassandra cares it's storing a byte[]. Now on the client side, I'm using cassandra-cli with a custom type that knows how to turn a byte[] into a JSON text and back. The only issue was CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 gets fixed, I'll get the best of both worlds. Also advantages of this vs. the thrift based Super Column families are: 1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize and compression/decompression happens on the client nodes where there is plenty idle CPU time 2. Saving network bandwidth since I'm sending over a compressed byte[] -- Drew On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote: On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
Thanks Jonathan. The only reason I suggested JSON was because it already has support for lists. Native support for lists in Cassandra would more than satisfy me. Are there any existing proposals or a bug I can follow? I'm not familiar with the Cassandra codebase, so I'm not entirely sure how helpful I can be, but I'd certainly be interested in taking a look to see what's required. -Ben On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill b...@alumni.brown.eduwrote: Jonathan, I was actually going to take this up with Nate McCall a few weeks back. I think it might make sense to get the client development community together (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.) I agree whole-heartedly that it shouldn't go into the database for all the reasons you point out. If we can all decide on some standards for data storage (e.g. composite types), indexing strategies, etc. We can provide higher-level functions through the client libraries and also provide interoperability between them. (without bloating Cassandra) CCing Nate. Nate, thoughts? I wouldn't mind coordinating/facilitating the conversation. If we know who should be involved. -brian Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote: Jonathan, I asked Brian about his REST API https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas 9C8Usand he said he does not take the json objects and split them because the client libraries do not agree on implementations. This was exactly my concern as well with this solution. I would be perfectly happy to do it this way instead of using JSON if it were standardized. The reason I suggested JSON is that it is standardized. As far as I can tell, Cassandra doesn't support maps and lists in a standardized way today, which is the root of my problem. -Ben On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com wrote: Yes, I meant the row header index. What I have done is that I'm storing an object (i.e. UserProfile) where you read or write it as a whole (a user updates their user details in a single page in the UI). So I serialize that object into a binary JSON using SMILE format. I then compress it using Snappy on the client side. So as far as Cassandra cares it's storing a byte[]. Now on the client side, I'm using cassandra-cli with a custom type that knows how to turn a byte[] into a JSON text and back. The only issue was CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 gets fixed, I'll get the best of both worlds. Also advantages of this vs. the thrift based Super Column families are: 1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize and compression/decompression happens on the client nodes where there is plenty idle CPU time 2. Saving network bandwidth since I'm sending over a compressed byte[] -- Drew On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote: On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
Jonathan, We store JSON as our column values. I'd love to see support for maps and lists. If I get some time this weekend, I'll take a look to see what is required. I doesn't seem like it would be that hard. -brian Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ On 3/29/12 3:18 PM, Jonathan Ellis jbel...@gmail.com wrote: On Thu, Mar 29, 2012 at 2:06 PM, Ben McCann b...@benmccann.com wrote: As far as I can tell, Cassandra doesn't support maps and lists in a standardized way today, which is the root of my problem. I'm pretty serious about adding those for 1.2, for what that's worth. (If you want to jump in and help code that up, so much the better.) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
I kind of hijacked https://issues.apache.org/jira/browse/CASSANDRA-3647 (Sylvain suggests we start with (non-nested) lists, maps, and sets. I agree that this is a great 80/20 approach to the problem) but we could split it out to another ticket. On Thu, Mar 29, 2012 at 2:24 PM, Ben McCann b...@benmccann.com wrote: Thanks Jonathan. The only reason I suggested JSON was because it already has support for lists. Native support for lists in Cassandra would more than satisfy me. Are there any existing proposals or a bug I can follow? I'm not familiar with the Cassandra codebase, so I'm not entirely sure how helpful I can be, but I'd certainly be interested in taking a look to see what's required. -Ben On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill b...@alumni.brown.eduwrote: Jonathan, I was actually going to take this up with Nate McCall a few weeks back. I think it might make sense to get the client development community together (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.) I agree whole-heartedly that it shouldn't go into the database for all the reasons you point out. If we can all decide on some standards for data storage (e.g. composite types), indexing strategies, etc. We can provide higher-level functions through the client libraries and also provide interoperability between them. (without bloating Cassandra) CCing Nate. Nate, thoughts? I wouldn't mind coordinating/facilitating the conversation. If we know who should be involved. -brian Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote: Jonathan, I asked Brian about his REST API https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas 9C8Usand he said he does not take the json objects and split them because the client libraries do not agree on implementations. This was exactly my concern as well with this solution. I would be perfectly happy to do it this way instead of using JSON if it were standardized. The reason I suggested JSON is that it is standardized. As far as I can tell, Cassandra doesn't support maps and lists in a standardized way today, which is the root of my problem. -Ben On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com wrote: Yes, I meant the row header index. What I have done is that I'm storing an object (i.e. UserProfile) where you read or write it as a whole (a user updates their user details in a single page in the UI). So I serialize that object into a binary JSON using SMILE format. I then compress it using Snappy on the client side. So as far as Cassandra cares it's storing a byte[]. Now on the client side, I'm using cassandra-cli with a custom type that knows how to turn a byte[] into a JSON text and back. The only issue was CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 gets fixed, I'll get the best of both worlds. Also advantages of this vs. the thrift based Super Column families are: 1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize and compression/decompression happens on the client nodes where there is plenty idle CPU time 2. Saving network bandwidth since I'm sending over a compressed byte[] -- Drew On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote: On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com --
Re: Document storage
Cool. How were you thinking we should store the data? As a stanardized composite column (e.g. potentially a list as [fieldName, TimeUUID]: fieldValue and a set as [fieldName, fieldValue ]:)? Or as a new column type? On Thu, Mar 29, 2012 at 12:35 PM, Jonathan Ellis jbel...@gmail.com wrote: I kind of hijacked https://issues.apache.org/jira/browse/CASSANDRA-3647 (Sylvain suggests we start with (non-nested) lists, maps, and sets. I agree that this is a great 80/20 approach to the problem) but we could split it out to another ticket. On Thu, Mar 29, 2012 at 2:24 PM, Ben McCann b...@benmccann.com wrote: Thanks Jonathan. The only reason I suggested JSON was because it already has support for lists. Native support for lists in Cassandra would more than satisfy me. Are there any existing proposals or a bug I can follow? I'm not familiar with the Cassandra codebase, so I'm not entirely sure how helpful I can be, but I'd certainly be interested in taking a look to see what's required. -Ben On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill b...@alumni.brown.edu wrote: Jonathan, I was actually going to take this up with Nate McCall a few weeks back. I think it might make sense to get the client development community together (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.) I agree whole-heartedly that it shouldn't go into the database for all the reasons you point out. If we can all decide on some standards for data storage (e.g. composite types), indexing strategies, etc. We can provide higher-level functions through the client libraries and also provide interoperability between them. (without bloating Cassandra) CCing Nate. Nate, thoughts? I wouldn't mind coordinating/facilitating the conversation. If we know who should be involved. -brian Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ On 3/29/12 3:06 PM, Ben McCann b...@benmccann.com wrote: Jonathan, I asked Brian about his REST API https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas 9C8Usand he said he does not take the json objects and split them because the client libraries do not agree on implementations. This was exactly my concern as well with this solution. I would be perfectly happy to do it this way instead of using JSON if it were standardized. The reason I suggested JSON is that it is standardized. As far as I can tell, Cassandra doesn't support maps and lists in a standardized way today, which is the root of my problem. -Ben On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian d...@venarc.com wrote: Yes, I meant the row header index. What I have done is that I'm storing an object (i.e. UserProfile) where you read or write it as a whole (a user updates their user details in a single page in the UI). So I serialize that object into a binary JSON using SMILE format. I then compress it using Snappy on the client side. So as far as Cassandra cares it's storing a byte[]. Now on the client side, I'm using cassandra-cli with a custom type that knows how to turn a byte[] into a JSON text and back. The only issue was CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 gets fixed, I'll get the best of both worlds. Also advantages of this vs. the thrift based Super Column families are: 1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize and compression/decompression happens on the client nodes where there is plenty idle CPU time 2. Saving network bandwidth since I'm sending over a compressed byte[] -- Drew On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote: On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight
Document storage
Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
Sounds interesting to me. I looked into adding protocol buffer support at one point, and it didn't look like it would be too much work. The tricky part was I also wanted to add indexing support for attributes of the inserted protocol buffers. That looked a little trickier, but still not impossible. Though other stuff came up and I never got around to actually writing any code. JSON support would be nice, especially if you figured out how to get built in indexing of the attributes inside the JSON to work =). -Jeremiah On Mar 28, 2012, at 10:58 AM, Ben McCann wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
On Wed, Mar 28, 2012 at 6:59 PM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Sounds interesting to me. I looked into adding protocol buffer support at one point, and it didn't look like it would be too much work. The tricky part was I also wanted to add indexing support for attributes of the inserted protocol buffers. That looked a little trickier, but still not impossible. Though other stuff came up and I never got around to actually writing any code. JSON support would be nice, especially if you figured out how to get built in indexing of the attributes inside the JSON to work =). Also, for whatever it's worth, it should be trivial to add support for Smile (binary JSON serialization): http://wiki.fasterxml.com/SmileFormatSpec since its logical data structure is pure JSON, no extensions or subsetting. The main Java impl is by Jackson project, but there is also a C codec (https://github.com/pierre/libsmile), and prototypes for PHP and Ruby bindings as well. But for all data it's bit faster, bit more compact; about 30% for individual items, but more (40 - 70%) for data sequences (due to optional back-referencing). JSON and Smile can be auto-detected from first 4 bytes or so, reliably and efficiently, so one should be able to add this either transparently or explicitly. One could even transcode things on the fly -- store as Smile, expose filtered results as JSON (and accept JSON or both). This could reduce storage cost while keep the benefits of flexible data format. -+ Tatu +-
Re: Document storage
Some work I did stores JSON blobs in columns. The question on JSON type is how to sort it. On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
I don't imagine sort is a meaningful operation on JSON data. As long as the sorting is consistent I would think that should be sufficient. On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Some work I did stores JSON blobs in columns. The question on JSON type is how to sort it. On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben