Re: OoM querying very wide-row in CLI
Sorry, I didn't realize we weren't hip to pulls yet. I created a JIRA and attached the patch. https://issues.apache.org/jira/browse/CASSANDRA-4098 -brian On Tue, Mar 27, 2012 at 10:42 PM, Brian O'Neill b...@alumni.brown.eduwrote: Here she is: https://github.com/apache/cassandra/pull/8 Verified functionally with the attached data script. -brian On Tue, Mar 27, 2012 at 9:49 PM, Brian O'Neill b...@alumni.brown.eduwrote: 10-4. I'll see if I can track it down and submit a pull request that specifies a default if one does not exist. -brian Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ On 3/27/12 9:45 PM, Jonathan Ellis jbel...@gmail.com wrote: I believe we added support for specifying a column range to the cli recently. I don't know if there is a default limit. On Tue, Mar 27, 2012 at 8:40 PM, Brian O'Neill b...@alumni.brown.edu wrote: Today, running 1.0.7, we saw a node crash with an OutOfMemory. We have a single row with ~10million columns in it. (using it as an index) Accidentally, we attempted to list the CF in CLI that had the wide-row. This caused the CLI to hang and then eventually crashed Cassandra with an OoM. I know this is a case of If it hurts when you do that, don't do that, but we may want to better protect against it in the CLI and/or the DB. I know we limit row counts on lists in CLI. Do we also limit column counts? If not, I don't mind submitting a patch for this. let me know, brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Document storage
Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
Sounds interesting to me. I looked into adding protocol buffer support at one point, and it didn't look like it would be too much work. The tricky part was I also wanted to add indexing support for attributes of the inserted protocol buffers. That looked a little trickier, but still not impossible. Though other stuff came up and I never got around to actually writing any code. JSON support would be nice, especially if you figured out how to get built in indexing of the attributes inside the JSON to work =). -Jeremiah On Mar 28, 2012, at 10:58 AM, Ben McCann wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
On Wed, Mar 28, 2012 at 6:59 PM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Sounds interesting to me. I looked into adding protocol buffer support at one point, and it didn't look like it would be too much work. The tricky part was I also wanted to add indexing support for attributes of the inserted protocol buffers. That looked a little trickier, but still not impossible. Though other stuff came up and I never got around to actually writing any code. JSON support would be nice, especially if you figured out how to get built in indexing of the attributes inside the JSON to work =). Also, for whatever it's worth, it should be trivial to add support for Smile (binary JSON serialization): http://wiki.fasterxml.com/SmileFormatSpec since its logical data structure is pure JSON, no extensions or subsetting. The main Java impl is by Jackson project, but there is also a C codec (https://github.com/pierre/libsmile), and prototypes for PHP and Ruby bindings as well. But for all data it's bit faster, bit more compact; about 30% for individual items, but more (40 - 70%) for data sequences (due to optional back-referencing). JSON and Smile can be auto-detected from first 4 bytes or so, reliably and efficiently, so one should be able to add this either transparently or explicitly. One could even transcode things on the fly -- store as Smile, expose filtered results as JSON (and accept JSON or both). This could reduce storage cost while keep the benefits of flexible data format. -+ Tatu +-
Re: Document storage
Some work I did stores JSON blobs in columns. The question on JSON type is how to sort it. On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
I don't imagine sort is a meaningful operation on JSON data. As long as the sorting is consistent I would think that should be sufficient. On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Some work I did stores JSON blobs in columns. The question on JSON type is how to sort it. On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben