Re: 2.0
I agree with Edward here. We use Thrift too and we haven't really found a good enough reason to move to CQL3. -- Drew On Dec 1, 2012, at 10:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I do not understand why everyone wants to force this issue on removing thrift. If cql, cql sparse tables and the new transport are better people will naturally begin to use them, but as it stands now I see the it this way: Thrift still has more clients for more languages, thrift has more higher level clients for more languages. Thrift has Hadoop support hive support and pig support in the wild. Thrift has third party tools like Orm tools, support for tools like flume. Most of cql3 features like collections do not work with compact tables, and compact tables are much more space efficient then their cql3 sparse counterparts, composite rows with UTf column names, blank rows, etc. Cql3 binary client is only available for in beta stage for a few languages. So the project can easily remove thrift today but until a majority of the tooling by the community adopts the transport and for the most part cqls sparse tables it is not going to mean anything. Many people already have code live in production working fine with the old toolset and will be unwilling to convert something just because Think about it like this a company like mine that already has something in production. Even if you could convince me us that cql native transport was better, which by the way no one has showed me a vast performance reason to this point, they still may not want to invest the resources to convert their app. Many companies endured the painful transition from Cassandra 0.6 to Cassandra 0.7 conversion and they are not eagerly going to entertain another change which is mostly cosmetic. Also I find issues like this extremely frustrating. https://issues.apache.org/jira/browse/CASSANDRA-4924 It seems like the project is drawing a hard line in the sand dividing people. Is it the case that cql3's sparse tables can't be accessed by thrift, or is it the case that no one wants to make this happen? Like is it technically impossible? It seems not to me in Cassandra Row key, column, and value are all still byte arrays right? So I do not see why thrift users need to be locked out of them. Just like composites we will figure out how to pack the bytes. I hope that we can stop talking about removing thrift until there is some consensus between active users that it is not in use anymore. This consensus is not as simple as n committers saying that something is technically not needed anymore. It has to look at the users, the number of clients, the number of languages, the number of high level tools available. In the mean time when issues like 4924 pop up it would be better if people tried to find solutions for maximum forward and backward compatibility instead of drawing a line and trying to shut thrift users out of things. Avro was much the same way . I had a spirited debate on irc and got basicallly insulted because i belived thrift was not dead. The glory of avro never came true because it really did not work for clients outside a few languages. Cql and the binary transport has to pass this same litmus test. Let it gain momentum and have rock solid clients for 5 languages and have higher level tools written on top of it then its easy to say thrift is not needed anymore. On Saturday, December 1, 2012, Sylvain Lebresne wrote: I agree on 2.0. For the thrift part, we've said clearly that we wouldn't remove it any time soon so let's stick to that. Besides, I would agree it's too soon anyway. What we can do however in the relatively short term on that front, is to pull thrift in it's own jar (we've almost removed all internal dependencies on thrift, and the few remaining ones will be easy to kill) and make that jar optional if you don't want to use it. -- Sylvain On Sat, Dec 1, 2012 at 2:52 AM, Ray Slakinski ray.slakin...@gmail.comjavascript:; wrote: I agree, I don't think its a great idea to drop thrift until the back end tools are 100% compatible and have some level of agreement from the major users of Cassandra. Paying off technical dept though I'm all for, and I think its key to the long term success of the application. Right now Supercolumns to someone new coming to the system might think Hey, these things look great. Lets use them and in a few months time hate all things that are cassandra. Ray Slakinski On 12/01, Jonathan Ellis wrote: As attractive as it would be to clean house, I think we owe it to our users to keep Thrift around for the forseeable future rather than orphan all Thrift-using applications (which is virtually everyone) on 1.2. On Sat, Dec 1, 2012 at 7:33 AM, Jason Brown jasedbr...@gmail.comjavascript:; wrote: Hi Jonathan, I'm in favor of paying off the technical debt, as well, and I wonder if there is value in
Re: Document storage
I'm actually doing something almost the same. I serialize my objects into byte[] using Jackson's SMILE format, then compress it using Snappy then store the byte[] in Cassandra. I actually created a simple Cassandra Type for this but I hit a wall with cassandra-cli: https://issues.apache.org/jira/browse/CASSANDRA-4081 Please vote on the JIRA if you are interested. Validation is pretty simple, you just need to read the value and parse it using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;) -- Drew On Mar 28, 2012, at 9:28 PM, Ben McCann wrote: I don't imagine sort is a meaningful operation on JSON data. As long as the sorting is consistent I would think that should be sufficient. On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Some work I did stores JSON blobs in columns. The question on JSON type is how to sort it. On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
Hi Ben, Sure, there's nothing really to it, but I'll email it to you. As far as why I'm using Snappy on the type instead of sstable_compression is because when you set sstable_compression the compression happens on the Cassandra nodes and I see two advantages with my approach: 1. Saving extra CPU usage on the Cassandra nodes. Since compression/decompression can easily be done on the client nodes where there is plenty idle CPU time 2. Saving network bandwidth since you're sending over a compressed byte[] One thing to note about my approach is that when I define the schema in Cassandra, I define the columns as byte[] and not my custom type and I do all the conversion on the client side. -- Drew On Mar 29, 2012, at 12:04 AM, Ben McCann wrote: Sounds awesome Drew. Mind sharing your custom type? I just wrote a basic JSON type and did the validation the same way you did, but I don't have any SMILE support yet. It seems that if your type were committed to the Cassandra codebase then the issue you ran into of the CLI only supporting built-in types would no longer be a problem for you (though fixing the issue anyway would be good and I voted for it). Btw, any reason you compress it with Snappy yourself instead of just setting sstable_compression to SnappyCompressorhttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compressionand letting Cassandra do that part? -Ben On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian d...@venarc.com wrote: I'm actually doing something almost the same. I serialize my objects into byte[] using Jackson's SMILE format, then compress it using Snappy then store the byte[] in Cassandra. I actually created a simple Cassandra Type for this but I hit a wall with cassandra-cli: https://issues.apache.org/jira/browse/CASSANDRA-4081 Please vote on the JIRA if you are interested. Validation is pretty simple, you just need to read the value and parse it using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;) -- Drew On Mar 28, 2012, at 9:28 PM, Ben McCann wrote: I don't imagine sort is a meaningful operation on JSON data. As long as the sorting is consistent I would think that should be sufficient. On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Some work I did stores JSON blobs in columns. The question on JSON type is how to sort it. On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: I don't speak for the project, but you might give it a day or two for people to respond and/or perhaps create a jira ticket. Seems like that's a reasonable data type that would get some traction - a json type. However, what would validation look like? That's one of the main reasons there are the data types and validators, in order to validate on insert. On Mar 29, 2012, at 12:27 AM, Ben McCann wrote: Any thoughts? I'd like to submit a patch, but only if it will be accepted. Thanks, Ben On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann b...@benmccann.com wrote: Hi, I was wondering if it would be interesting to add some type of document-oriented data type. I've found it somewhat awkward to store document-oriented data in Cassandra today. I can make a JSON/Protobuf/Thrift, serialize it, and store it, but Cassandra cannot differentiate it from any other string or byte array. However, if my column validation_class could be a JsonType that would allow tools to potentially do more interesting introspection on the column value. E.g. bug 3647 https://issues.apache.org/jira/browse/CASSANDRA-3647calls for supporting arbitrarily nested documents in CQL. Running a query against the JSON column in Pig is possible as well, but again in this use case it would be helpful to be able to encode in column metadata that the column is stored as JSON. For debugging, running nightly reports, etc. it would be quite useful compared to the opaque string and byte array types we have today. JSON is appealing because it would be easy to implement. Something like Thrift or Protocol Buffers would actually be interesting since they would be more space efficient. However, they would also be a bit more difficult to implement because of the extra typing information they provide. I'm hoping with Cassandra 1.0's addition of compression that storing JSON is not too inefficient. Would there be interest in adding a JsonType? I could look at putting a patch together. Thanks, Ben
Re: Document storage
I agree with Edward here, the simpler we keep the core the better. I think all the ser/deser and conversions should happen on the client side. -- Drew On Mar 29, 2012, at 8:36 AM, Edward Capriolo wrote: The issue with these super complex types is to do anything useful with them you would either need scanners or co processors. As its stands right now complex data like json is fairly opaque to Cassandra. Getting cassandra to natively speak protobuffs or whatever flavor of the week serialization framework is hip right now we make the codebase very large. How is that field sorted? How is it indexed? This is starting to go very far against the schema-less nosql grain. Where does this end up users wanting to store binary XML index it and feed cassandra XPath queries? On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann b...@benmccann.com wrote: Creating materialized paths may well be a possible solution. If that were the solution the community were to agree upon then I would like it to be a standardized and well-documented best practice. I asked how to store a list of values on the user listhttp://www.mail-archive.com/user@cassandra.apache.org/msg21274.html and no one suggested [fieldName, TimeUUID]: fieldValue. It would be a huge pain right now to create materialized paths like this for each of my objects, so client library support would definitely be needed. And the client libraries should agree. If Astyanax and lazyboy both add support for materialized path and I write an object to Cassandra with Astyanax, then I should be able to read it back with lazyboy. The benefit of using JSON/SMILE is that it's very clear that there's exactly one way to serialize and deserialize the data and it's very easy. It's not clear to me that this is true using materialized paths. On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson tpatter...@datastax.comwrote: Would there be interest in adding a JsonType? What about checking that data inserted into a JsonType is valid JSON? How would you do it, and would the overhead be something we are concerned about, especially if the JSON string is large?
Re: Document storage
I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? On Mar 29, 2012, at 9:23 AM, Jonathan Ellis wrote: On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Its not clear what 3647 actually is, there is no code attached, and no real example in it. Aside from that, the reason this would be useful to me (if we could get indexing of attributes working), is that I already have my data in JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to break it up into columns to insert, and re-assemble into columns to read. I don't understand the problem. Assuming Cassandra support for maps and lists, I could write a Python module that takes json (or thrift, or protobuf) objects and splits them into Cassandra rows by fields in a couple hours. I'm pretty sure this is essentially what Brian's REST api for Cassandra does now. I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Document storage
Yes, I meant the row header index. What I have done is that I'm storing an object (i.e. UserProfile) where you read or write it as a whole (a user updates their user details in a single page in the UI). So I serialize that object into a binary JSON using SMILE format. I then compress it using Snappy on the client side. So as far as Cassandra cares it's storing a byte[]. Now on the client side, I'm using cassandra-cli with a custom type that knows how to turn a byte[] into a JSON text and back. The only issue was CASSANDRA-4081 where assume doesn't work with custom types. If CASSANDRA-4081 gets fixed, I'll get the best of both worlds. Also advantages of this vs. the thrift based Super Column families are: 1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize and compression/decompression happens on the client nodes where there is plenty idle CPU time 2. Saving network bandwidth since I'm sending over a compressed byte[] -- Drew On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote: On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian d...@venarc.com wrote: I think this is a much better approach because that gives you the ability to update or retrieve just parts of objects efficiently, rather than making column values just blobs with a bunch of special case logic to introspect them. Which feels like a big step backwards to me. Unless your access pattern involves reading/writing the whole document each time. In that case you're better off serializing the whole document and storing it in a column as a byte[] without incurring the overhead of column indexes. Right? Hmm, not sure what you're thinking of there. If you mean the index that's part of the row header for random access within a row, then no, serializing to byte[] doesn't save you anything. If you mean secondary indexes, don't declare any if you don't want any. :) Just telling C* to store a byte[] *will* be slightly lighter-weight than giving it named columns, but we're talking negligible compared to the overhead of actually moving the data on or off disk in the first place. Not even close to being worth giving up being able to deal with your data from standard tools like cqlsh, IMO. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: major version release schedule
I think there are couple of different ideas here at play 1) Time to release 2) Quality of the release IMO, the issue that effects most people is the quality of the release. So when someone says that we should slow down the release cycles, I think what they mean is that we should spend more time improving the quality of the releases. Now if there is a process that can ensure the quality of the release, specially the newly added features, I don't think anyone would complain about have quick releases. -- Drew On Dec 20, 2011, at 4:15 PM, Peter Schuller wrote: Until recently we were working hard to reach a set of goals that culminated in a 1.0 release. I'm not sure we've had a formal discussion on it, but just talking to people, there seems to be consensus around the idea that we're now shifting our goals and priorities around some (usability, stability, etc). If that's the case, I think we should at least be open to reevaluating our release process and schedule accordingly (whether that means lengthening, shorting, and/or simply shifting the barrier-to-entry for stable updates). Personally I am all for added stability, quality, and testing. But I don't see how a decreased release frequency will cause more stability. It may be that decreased release frequency is the necessary *result* of more stability, but I don't think the causality points in the other direction unless developers ship things early to get it into the release. But also keep in mind: If we reach a point where major users of Cassandra need to run on significantly divergent versions of Cassandra because the release is just too old, the normal mainstream release will end up getting even less testing. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)