Re: OoM querying very wide-row in CLI

2012-03-28 Thread Brian O'Neill
Sorry, I didn't realize we weren't hip to pulls yet.

I created a JIRA and attached the patch.
https://issues.apache.org/jira/browse/CASSANDRA-4098

-brian

On Tue, Mar 27, 2012 at 10:42 PM, Brian O'Neill b...@alumni.brown.eduwrote:

 Here she is:
 https://github.com/apache/cassandra/pull/8

 Verified functionally with the attached data script.

 -brian



 On Tue, Mar 27, 2012 at 9:49 PM, Brian O'Neill b...@alumni.brown.eduwrote:

 10-4.  I'll see if I can track it down and submit a pull request that
 specifies a default if one does not exist.

 -brian

 
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/







 On 3/27/12 9:45 PM, Jonathan Ellis jbel...@gmail.com wrote:

 I believe we added support for specifying a column range to the cli
 recently.  I don't know if there is a default limit.
 
 On Tue, Mar 27, 2012 at 8:40 PM, Brian O'Neill b...@alumni.brown.edu
 wrote:
  Today, running 1.0.7, we saw a node crash with an OutOfMemory.
  We have a single row with ~10million columns in it. (using it as an
 index)
  Accidentally, we attempted to list the CF in CLI that had the wide-row.
   This caused the CLI to hang and then eventually crashed Cassandra with
 an
  OoM.
 
  I know this is a case of If it hurts when you do that, don't do that,
 but
  we may want to better protect against it in the CLI and/or the DB.  I
 know
  we limit row counts on lists in CLI.  Do we also limit column counts?
 If
  not, I don't mind submitting a patch for this.
 
  let me know,
  brian
 
  --
  Brian ONeill
  Lead Architect, Health Market Science (http://healthmarketscience.com)
  mobile:215.588.6024
  blog: http://weblogs.java.net/blog/boneill42/
  blog: http://brianoneill.blogspot.com/
 
 
 
 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com





 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/




-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Document storage

2012-03-28 Thread Ben McCann
Hi,

I was wondering if it would be interesting to add some type of
document-oriented data type.

I've found it somewhat awkward to store document-oriented data in Cassandra
today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
Cassandra cannot differentiate it from any other string or byte array.
 However, if my column validation_class could be a JsonType that would
allow tools to potentially do more interesting introspection on the column
value.  E.g. bug 3647
https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
supporting arbitrarily nested documents in CQL.  Running a
query against the JSON column in Pig is possible as well, but again in this
use case it would be helpful to be able to encode in column metadata that
the column is stored as JSON.  For debugging, running nightly reports, etc.
it would be quite useful compared to the opaque string and byte array types
we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
since they would be more space efficient.  However, they would also be a
bit more difficult to implement because of the extra typing information
they provide.  I'm hoping with Cassandra 1.0's addition of compression that
storing JSON is not too inefficient.

Would there be interest in adding a JsonType?  I could look at putting a
patch together.

Thanks,
Ben


Re: Document storage

2012-03-28 Thread Ben McCann
Any thoughts?  I'd like to submit a patch, but only if it will be accepted.

Thanks,
Ben


On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
 store it, but Cassandra cannot differentiate it from any other string or
 byte array.  However, if my column validation_class could be a JsonType
 that would allow tools to potentially do more interesting introspection on
 the column value.  E.g. bug 
 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for 
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
  Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben




Re: Document storage

2012-03-28 Thread Jeremy Hanna
I don't speak for the project, but you might give it a day or two for people to 
respond and/or perhaps create a jira ticket.  Seems like that's a reasonable 
data type that would get some traction - a json type.  However, what would 
validation look like?  That's one of the main reasons there are the data types 
and validators, in order to validate on insert.

On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:

 Any thoughts?  I'd like to submit a patch, but only if it will be accepted.
 
 Thanks,
 Ben
 
 
 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:
 
 Hi,
 
 I was wondering if it would be interesting to add some type of
 document-oriented data type.
 
 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
 store it, but Cassandra cannot differentiate it from any other string or
 byte array.  However, if my column validation_class could be a JsonType
 that would allow tools to potentially do more interesting introspection on
 the column value.  E.g. bug 
 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for 
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.
 
 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.
 
 Thanks,
 Ben
 
 



Re: Document storage

2012-03-28 Thread Jeremiah Jordan
Sounds interesting to me.  I looked into adding protocol buffer support at one 
point, and it didn't look like it would be too much work.  The tricky part was 
I also wanted to add indexing support for attributes of the inserted protocol 
buffers.  That looked a little trickier, but still not impossible.  Though 
other stuff came up and I never got around to actually writing any code.
JSON support would be nice, especially if you figured out how to get built in 
indexing of the attributes inside the JSON to work =).

-Jeremiah

On Mar 28, 2012, at 10:58 AM, Ben McCann wrote:

 Hi,
 
 I was wondering if it would be interesting to add some type of
 document-oriented data type.
 
 I've found it somewhat awkward to store document-oriented data in Cassandra
 today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
 Cassandra cannot differentiate it from any other string or byte array.
 However, if my column validation_class could be a JsonType that would
 allow tools to potentially do more interesting introspection on the column
 value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.
 
 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.
 
 Thanks,
 Ben



Re: Document storage

2012-03-28 Thread Tatu Saloranta
On Wed, Mar 28, 2012 at 6:59 PM, Jeremiah Jordan
jeremiah.jor...@morningstar.com wrote:
 Sounds interesting to me.  I looked into adding protocol buffer support at 
 one point, and it didn't look like it would be too much work.  The tricky 
 part was I also wanted to add indexing support for attributes of the inserted 
 protocol buffers.  That looked a little trickier, but still not impossible.  
 Though other stuff came up and I never got around to actually writing any 
 code.
 JSON support would be nice, especially if you figured out how to get built in 
 indexing of the attributes inside the JSON to work =).

Also, for whatever it's worth, it should be trivial to add support for
Smile (binary JSON serialization):
http://wiki.fasterxml.com/SmileFormatSpec
since its logical data structure is pure JSON, no extensions or
subsetting. The main Java impl is by Jackson project, but there is
also a C codec (https://github.com/pierre/libsmile), and prototypes
for PHP and Ruby bindings as well.
But for all data it's bit faster, bit more compact; about 30% for
individual items, but more (40 - 70%) for data sequences (due to
optional back-referencing).

JSON and Smile can be auto-detected from first 4 bytes or so, reliably
and efficiently, so one should be able to add this either
transparently or explicitly.
One could even transcode things on the fly -- store as Smile, expose
filtered results as JSON (and accept JSON or both). This could reduce
storage cost while keep the benefits of flexible data format.

-+ Tatu +-


Re: Document storage

2012-03-28 Thread Edward Capriolo
Some work I did stores JSON blobs in columns. The question on JSON
type is how to sort it.

On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
jeremy.hanna1...@gmail.com wrote:
 I don't speak for the project, but you might give it a day or two for people 
 to respond and/or perhaps create a jira ticket.  Seems like that's a 
 reasonable data type that would get some traction - a json type.  However, 
 what would validation look like?  That's one of the main reasons there are 
 the data types and validators, in order to validate on insert.

 On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:

 Any thoughts?  I'd like to submit a patch, but only if it will be accepted.

 Thanks,
 Ben


 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:

 Hi,

 I was wondering if it would be interesting to add some type of
 document-oriented data type.

 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
 store it, but Cassandra cannot differentiate it from any other string or
 byte array.  However, if my column validation_class could be a JsonType
 that would allow tools to potentially do more interesting introspection on
 the column value.  E.g. bug 
 3647https://issues.apache.org/jira/browse/CASSANDRA-3647calls for 
 supporting arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in this
 use case it would be helpful to be able to encode in column metadata that
 the column is stored as JSON.  For debugging, running nightly reports, etc.
 it would be quite useful compared to the opaque string and byte array types
 we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression that
 storing JSON is not too inefficient.

 Would there be interest in adding a JsonType?  I could look at putting a
 patch together.

 Thanks,
 Ben





Re: Document storage

2012-03-28 Thread Ben McCann
I don't imagine sort is a meaningful operation on JSON data.  As long as
the sorting is consistent I would think that should be sufficient.


On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Some work I did stores JSON blobs in columns. The question on JSON
 type is how to sort it.

 On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
 jeremy.hanna1...@gmail.com wrote:
  I don't speak for the project, but you might give it a day or two for
 people to respond and/or perhaps create a jira ticket.  Seems like that's a
 reasonable data type that would get some traction - a json type.  However,
 what would validation look like?  That's one of the main reasons there are
 the data types and validators, in order to validate on insert.
 
  On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
 
  Any thoughts?  I'd like to submit a patch, but only if it will be
 accepted.
 
  Thanks,
  Ben
 
 
  On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:
 
  Hi,
 
  I was wondering if it would be interesting to add some type of
  document-oriented data type.
 
  I've found it somewhat awkward to store document-oriented data in
  Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
  store it, but Cassandra cannot differentiate it from any other string
 or
  byte array.  However, if my column validation_class could be a JsonType
  that would allow tools to potentially do more interesting
 introspection on
  the column value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting
 arbitrarily nested documents in CQL.  Running a
  query against the JSON column in Pig is possible as well, but again in
 this
  use case it would be helpful to be able to encode in column metadata
 that
  the column is stored as JSON.  For debugging, running nightly reports,
 etc.
  it would be quite useful compared to the opaque string and byte array
 types
  we have today.  JSON is appealing because it would be easy to
 implement.
  Something like Thrift or Protocol Buffers would actually be interesting
  since they would be more space efficient.  However, they would also be
 a
  bit more difficult to implement because of the extra typing information
  they provide.  I'm hoping with Cassandra 1.0's addition of compression
 that
  storing JSON is not too inefficient.
 
  Would there be interest in adding a JsonType?  I could look at putting
 a
  patch together.
 
  Thanks,
  Ben