Re: 2.0

2012-12-02 Thread Drew Kutcharian
I agree with Edward here. We use Thrift too and we haven't really found a good 
enough reason to move to CQL3.

-- Drew

On Dec 1, 2012, at 10:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 I do not understand why everyone wants to force this issue on removing
 thrift. If cql, cql sparse tables and the new transport are better people
 will naturally begin to use them, but as it stands now I see the it
 this way:
 
 Thrift still has more clients for more languages, thrift has more higher
 level clients for more languages.
 Thrift has Hadoop support hive support and pig support in the wild.
 Thrift has third party tools like Orm tools, support for tools like flume.
 
 Most of cql3 features like collections do not work with compact tables,
 and compact tables are much more space efficient then their cql3 sparse
 counterparts, composite rows with UTf column names, blank rows, etc.
 Cql3 binary client is only available for in beta stage for a few languages.
 
 So the project can easily remove thrift today but until a majority of the
 tooling by the community adopts the transport and for the most part cqls
 sparse tables it is not going to mean anything. Many people already have
 code live in production working fine with the old toolset and will be
 unwilling to convert something just because
 
 Think about it like this a company like mine that already has something in
 production. Even if you could convince me us that cql native transport was
 better, which by the way no one has showed me a vast performance reason to
 this point, they still may not want to invest the resources to convert
 their app. Many companies endured the painful transition from Cassandra 0.6
 to Cassandra 0.7 conversion and they are not eagerly going to entertain
 another change which is mostly cosmetic.
 
 Also I find issues like this extremely frustrating.
 https://issues.apache.org/jira/browse/CASSANDRA-4924
 
 It seems like the project is drawing a hard line in the sand dividing
 people. Is it the case that cql3's sparse tables can't be accessed
 by thrift, or is it the case that no one wants to make this happen? Like is
 it technically impossible? It seems not to me in Cassandra
 Row key, column, and value are all still byte arrays right? So I do not see
 why thrift users need to be locked out of them. Just like composites we
 will figure out how to pack the bytes.
 
 I hope that we can stop talking about removing thrift until there is some
 consensus between active users that it is not in use anymore.
 This consensus is not as simple as n committers saying that something is
 technically not needed anymore. It has to look at the users, the number of
 clients, the number of languages, the number of high level tools available.
 In the mean time when issues like 4924 pop up it would be better if people
 tried to find solutions for maximum forward and backward compatibility
 instead of drawing a line and trying to shut thrift users out of things.
 
 Avro was much the same way . I had a spirited debate on irc and got
 basicallly insulted because i belived thrift was not dead. The glory of
 avro never came true because it really did not work for clients outside a
 few languages. Cql and the binary transport has to pass this same litmus
 test. Let it gain momentum and have rock solid clients for 5 languages and
 have higher level tools written on top of it then its easy to say thrift is
 not needed anymore.
 
 
 On Saturday, December 1, 2012, Sylvain Lebresne wrote:
 
 I agree on 2.0.
 
 For the thrift part, we've said clearly that we wouldn't remove it any time
 soon so let's stick to that. Besides, I would agree it's too soon anyway.
 What we can do however in the relatively short term on that front, is to
 pull thrift in it's own jar (we've almost removed all internal dependencies
 on thrift, and the few remaining ones will be easy to kill) and make that
 jar optional if you don't want to use it.
 
 --
 Sylvain
 
 
 On Sat, Dec 1, 2012 at 2:52 AM, Ray Slakinski 
 ray.slakin...@gmail.comjavascript:;
 wrote:
 
 I agree, I don't think its a great idea to drop thrift until the back
 end tools are 100% compatible and have some level of agreement from the
 major users of
 Cassandra.
 
 Paying off technical dept though I'm all for, and I think its key to the
 long term success of the application. Right now Supercolumns to someone
 new coming to the system might think Hey, these things look great. Lets
 use them and in a few months time hate all things that are cassandra.
 
 Ray Slakinski
 
 On 12/01, Jonathan Ellis wrote:
 As attractive as it would be to clean house, I think we owe it to our
 users to keep Thrift around for the forseeable future rather than
 orphan all Thrift-using applications (which is virtually everyone) on
 1.2.
 
 On Sat, Dec 1, 2012 at 7:33 AM, Jason Brown 
 jasedbr...@gmail.comjavascript:;
 
 wrote:
 Hi Jonathan,
 
 I'm in favor of paying off the technical debt, as well, and I wonder
 if
 there is value in 

Re: Document storage

2012-03-29 Thread Drew Kutcharian
I'm actually doing something almost the same. I serialize my objects into 
byte[] using Jackson's SMILE format, then compress it using Snappy then store 
the byte[] in Cassandra. I actually created a simple Cassandra Type for this 
but I hit a wall with cassandra-cli:

https://issues.apache.org/jira/browse/CASSANDRA-4081

Please vote on the JIRA if you are interested.

Validation is pretty simple, you just need to read the value and parse it using 
Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)

-- Drew



On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:

 I don't imagine sort is a meaningful operation on JSON data.  As long as
 the sorting is consistent I would think that should be sufficient.
 
 
 On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote:
 
 Some work I did stores JSON blobs in columns. The question on JSON
 type is how to sort it.
 
 On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
 jeremy.hanna1...@gmail.com wrote:
 I don't speak for the project, but you might give it a day or two for
 people to respond and/or perhaps create a jira ticket.  Seems like that's a
 reasonable data type that would get some traction - a json type.  However,
 what would validation look like?  That's one of the main reasons there are
 the data types and validators, in order to validate on insert.
 
 On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
 
 Any thoughts?  I'd like to submit a patch, but only if it will be
 accepted.
 
 Thanks,
 Ben
 
 
 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote:
 
 Hi,
 
 I was wondering if it would be interesting to add some type of
 document-oriented data type.
 
 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
 store it, but Cassandra cannot differentiate it from any other string
 or
 byte array.  However, if my column validation_class could be a JsonType
 that would allow tools to potentially do more interesting
 introspection on
 the column value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting
 arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again in
 this
 use case it would be helpful to be able to encode in column metadata
 that
 the column is stored as JSON.  For debugging, running nightly reports,
 etc.
 it would be quite useful compared to the opaque string and byte array
 types
 we have today.  JSON is appealing because it would be easy to
 implement.
 Something like Thrift or Protocol Buffers would actually be interesting
 since they would be more space efficient.  However, they would also be
 a
 bit more difficult to implement because of the extra typing information
 they provide.  I'm hoping with Cassandra 1.0's addition of compression
 that
 storing JSON is not too inefficient.
 
 Would there be interest in adding a JsonType?  I could look at putting
 a
 patch together.
 
 Thanks,
 Ben
 
 
 
 



Re: Document storage

2012-03-29 Thread Drew Kutcharian
Hi Ben,

Sure, there's nothing really to it, but I'll email it to you. As far as why I'm 
using Snappy on the type instead of sstable_compression is because when you set 
sstable_compression the compression happens on the Cassandra nodes and I see 
two advantages with my approach:

1. Saving extra CPU usage on the Cassandra nodes. Since 
compression/decompression can easily be done on the client nodes where there is 
plenty idle CPU time

2. Saving network bandwidth since you're sending over a compressed byte[]

One thing to note about my approach is that when I define the schema in 
Cassandra, I define the columns as byte[] and not my custom type and I do all 
the conversion on the client side.

-- Drew




On Mar 29, 2012, at 12:04 AM, Ben McCann wrote:

 Sounds awesome Drew.  Mind sharing your custom type?  I just wrote a basic
 JSON type and did the validation the same way you did, but I don't have any
 SMILE support yet.  It seems that if your type were committed to the
 Cassandra codebase then the issue you ran into of the CLI only supporting
 built-in types would no longer be a problem for you (though fixing the
 issue anyway would be good and I voted for it).  Btw, any reason you
 compress it with Snappy yourself instead of just setting sstable_compression
 to 
 SnappyCompressorhttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compressionand
 letting Cassandra do that part?
 
 -Ben
 
 
 On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian d...@venarc.com wrote:
 
 I'm actually doing something almost the same. I serialize my objects into
 byte[] using Jackson's SMILE format, then compress it using Snappy then
 store the byte[] in Cassandra. I actually created a simple Cassandra Type
 for this but I hit a wall with cassandra-cli:
 
 https://issues.apache.org/jira/browse/CASSANDRA-4081
 
 Please vote on the JIRA if you are interested.
 
 Validation is pretty simple, you just need to read the value and parse it
 using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)
 
 -- Drew
 
 
 
 On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:
 
 I don't imagine sort is a meaningful operation on JSON data.  As long as
 the sorting is consistent I would think that should be sufficient.
 
 
 On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
 Some work I did stores JSON blobs in columns. The question on JSON
 type is how to sort it.
 
 On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
 jeremy.hanna1...@gmail.com wrote:
 I don't speak for the project, but you might give it a day or two for
 people to respond and/or perhaps create a jira ticket.  Seems like
 that's a
 reasonable data type that would get some traction - a json type.
 However,
 what would validation look like?  That's one of the main reasons there
 are
 the data types and validators, in order to validate on insert.
 
 On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
 
 Any thoughts?  I'd like to submit a patch, but only if it will be
 accepted.
 
 Thanks,
 Ben
 
 
 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com
 wrote:
 
 Hi,
 
 I was wondering if it would be interesting to add some type of
 document-oriented data type.
 
 I've found it somewhat awkward to store document-oriented data in
 Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it,
 and
 store it, but Cassandra cannot differentiate it from any other string
 or
 byte array.  However, if my column validation_class could be a
 JsonType
 that would allow tools to potentially do more interesting
 introspection on
 the column value.  E.g. bug 3647
 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for
 supporting
 arbitrarily nested documents in CQL.  Running a
 query against the JSON column in Pig is possible as well, but again
 in
 this
 use case it would be helpful to be able to encode in column metadata
 that
 the column is stored as JSON.  For debugging, running nightly
 reports,
 etc.
 it would be quite useful compared to the opaque string and byte array
 types
 we have today.  JSON is appealing because it would be easy to
 implement.
 Something like Thrift or Protocol Buffers would actually be
 interesting
 since they would be more space efficient.  However, they would also
 be
 a
 bit more difficult to implement because of the extra typing
 information
 they provide.  I'm hoping with Cassandra 1.0's addition of
 compression
 that
 storing JSON is not too inefficient.
 
 Would there be interest in adding a JsonType?  I could look at
 putting
 a
 patch together.
 
 Thanks,
 Ben
 
 
 
 
 
 



Re: Document storage

2012-03-29 Thread Drew Kutcharian
I agree with Edward here, the simpler we keep the core the better. I think all 
the ser/deser and conversions should happen on the client side.

-- Drew


On Mar 29, 2012, at 8:36 AM, Edward Capriolo wrote:

 The issue with these super complex types is to do anything useful with
 them you would either need scanners or co processors. As its stands
 right now complex data like json is fairly opaque to Cassandra.
 Getting cassandra to natively speak protobuffs or whatever flavor of
 the week serialization framework is hip right now we make the codebase
 very large. How is that field sorted? How is it indexed? This is
 starting to go very far against the schema-less nosql grain. Where
 does this end up users wanting to store binary XML index it and feed
 cassandra XPath queries?
 
 
 On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann b...@benmccann.com wrote:
 Creating materialized paths may well be a possible solution.  If that were
 the solution the community were to agree upon then I would like it to be a
 standardized and well-documented best practice.  I asked how to store a
 list of values on the user
 listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html
 and
 no one suggested [fieldName, TimeUUID]: fieldValue.  It would be a
 huge pain right now to create materialized paths like this for each of my
 objects, so client library support would definitely be needed.  And the
 client libraries should agree.  If Astyanax and lazyboy both add support
 for materialized path and I write an object to Cassandra with Astyanax,
 then I should be able to read it back with lazyboy.  The benefit of using
 JSON/SMILE is that it's very clear that there's exactly one way to
 serialize and deserialize the data and it's very easy.  It's not clear to
 me that this is true using materialized paths.
 
 
 On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson 
 tpatter...@datastax.comwrote:
 
 
 
 Would there be interest in adding a JsonType?
 
 
 What about checking that data inserted into a JsonType is valid JSON? How
 would you do it, and would the overhead be something we are concerned
 about, especially if the JSON string is large?
 



Re: Document storage

2012-03-29 Thread Drew Kutcharian
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.

Unless your access pattern involves reading/writing the whole document each 
time. In that case you're better off serializing the whole document and storing 
it in a column as a byte[] without incurring the overhead of column indexes. 
Right?


On Mar 29, 2012, at 9:23 AM, Jonathan Ellis wrote:

 On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
 jeremiah.jor...@morningstar.com wrote:
 Its not clear what 3647 actually is, there is no code attached, and no real 
 example in it.
 
 Aside from that, the reason this would be useful to me (if we could get 
 indexing of attributes working), is that I already have my data in 
 JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
 break it up into columns to insert, and re-assemble into columns to read.
 
 I don't understand the problem.  Assuming Cassandra support for maps
 and lists, I could write a Python module that takes json (or thrift,
 or protobuf) objects and splits them into Cassandra rows by fields in
 a couple hours.  I'm pretty sure this is essentially what Brian's REST
 api for Cassandra does now.
 
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: Document storage

2012-03-29 Thread Drew Kutcharian
Yes, I meant the row header index. What I have done is that I'm storing an 
object (i.e. UserProfile) where you read or write it as a whole (a user updates 
their user details in a single page in the UI). So I serialize that object into 
a binary JSON using SMILE format. I then compress it using Snappy on the client 
side. So as far as Cassandra cares it's storing a byte[].

Now on the client side, I'm using cassandra-cli with a custom type that knows 
how to turn a byte[] into a JSON text and back. The only issue was 
CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 
gets fixed, I'll get the best of both worlds.

Also advantages of this vs. the thrift based Super Column families are:

1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize 
and compression/decompression happens on the client nodes where there is plenty 
idle CPU time

2. Saving network bandwidth since I'm sending over a compressed byte[]


-- Drew



On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

 On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote:
 I think this is a much better approach because that gives you the
 ability to update or retrieve just parts of objects efficiently,
 rather than making column values just blobs with a bunch of special
 case logic to introspect them.  Which feels like a big step backwards
 to me.
 
 Unless your access pattern involves reading/writing the whole document each 
 time. In that case you're better off serializing the whole document and 
 storing it in a column as a byte[] without incurring the overhead of column 
 indexes. Right?
 
 Hmm, not sure what you're thinking of there.
 
 If you mean the index that's part of the row header for random
 access within a row, then no, serializing to byte[] doesn't save you
 anything.
 
 If you mean secondary indexes, don't declare any if you don't want any. :)
 
 Just telling C* to store a byte[] *will* be slightly lighter-weight
 than giving it named columns, but we're talking negligible compared to
 the overhead of actually moving the data on or off disk in the first
 place.  Not even close to being worth giving up being able to deal
 with your data from standard tools like cqlsh, IMO.
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: major version release schedule

2011-12-24 Thread Drew Kutcharian
I think there are couple of different ideas here at play

1) Time to release

2) Quality of the release

IMO, the issue that effects most people is the quality of the release. So when 
someone says that we should slow down the release cycles, I think what they 
mean is that we should spend more time improving the quality of the releases. 
Now if there is a process that can ensure the quality of the release, specially 
the newly added features, I don't think anyone would complain about have quick 
releases.

-- Drew


On Dec 20, 2011, at 4:15 PM, Peter Schuller wrote:

 Until recently we were working hard to reach a set of goals that
 culminated in a 1.0 release.  I'm not sure we've had a formal
 discussion on it, but just talking to people, there seems to be
 consensus around the idea that we're now shifting our goals and
 priorities around some (usability, stability, etc).  If that's the
 case, I think we should at least be open to reevaluating our release
 process and schedule accordingly (whether that means lengthening,
 shorting, and/or simply shifting the barrier-to-entry for stable
 updates).
 
 Personally I am all for added stability, quality, and testing. But I
 don't see how a decreased release frequency will cause more stability.
 It may be that decreased release frequency is the necessary *result*
 of more stability, but I don't think the causality points in the other
 direction unless developers ship things early to get it into the
 release.
 
 But also keep in mind: If we reach a point where major users of
 Cassandra need to run on significantly divergent versions of Cassandra
 because the release is just too old, the normal mainstream release
 will end up getting even less testing.
 
 -- 
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)