Re: Proposal: freeze Thrift starting with 2.1.0
Hi Ed, I agree Solr is deeply integrated into DSE. I've looked at Solandra in the past and studied the code. My understanding is DSE uses Cassandra for storage and the user has both API available. I do think it can be integrated further to make moderate to complex queries easier and probably faster. That's why we built our own JPA-like object query API. I would love to see Cassandra get to the point where users can define complex queries with subqueries, like, group by and joins. Clearly lots of people want these features and even google built their own tools to do these types of queries. I see lots of people trying to improve this with Presto, Impala, drill, etc. To me, it's a natural progression as NoSql databases mature. For most people, at some point you want to be able to report/analyze the data. Today some people use MapReduce to summarize the data and ETL it into a relational database or OLAP database for reporting. Even though I don't need CAS or atomic batch for what I do in cassandra today, I'm sure in the future it will be handy. From my experience in the financial and insurance sector, features like CAS and select for update are important for the kinds of transactions they handle. I'm bias, these kinds of features are useful and good addition to cassandra. These are interesting times in database land! On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Peter, Solr is deeply integrated into DSE. Seemingly this can not efficiently be done client side (CQL/Thrift whatever) but the Solandra approach was to embed Solr in Cassandra. I think that is actually the future client dev, allowing users to embedded custom server side logic into there own API. Things like this take a while. Back in the day no one wanted cassandra to be heavy-weight and rejected ideas like read-before write operations. The common advice was do them client side. Now in the case of collections sometimes they do read-before-write and it is the stuff users want. On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin wool...@gmail.com wrote: I'll give you a concrete example. One of the things we often need to do is do a keyword search on unstructured text. What we did in our tooling is we combined solr with cassandra, but we put an Object API infront of it. The API is inspired by JPA, but designed specifically to fit our needs. the user can do queries with like %blah% and behind the scenes we issues a query to solr to find the keys and then query cassandra for the records. With plain Cassandra, the developer has to manually do all of this stuff and integrate solr. Then they have to know which system to query and in what order. Our tooling lets the user define the schema in a modeler. Once the model is done, it compiles the classes, configuration files, data access objects and unit tests. when the application makes a call, our query classes handle the details behind the scene. I know lots of people would like to see Solr integrated more deeply into Cassandra and CQL. I hope it happens in the future. If DataStax accepts my talk, we will be showing our temporal database and modeler in september. On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt srobe...@stanford.edu wrote: I should add that I'm not trying to ignite a flame war. Just trying to understand your intentions. On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt srobe...@stanford.edu wrote: Okay, I'm officially lost on this thread. If you plan on forking Cassandra to preserve and continue to enhance the Thrift interface, you would also want to add a bunch of relational features to CQL as part of that same fork? On Tue, Mar 11, 2014 at 6:20 PM, Edward Capriolo edlinuxg...@gmail.com wrote: one of the things I'd like to see happen is for Cassandra to support queries with disjunction, exist, subqueries, joins and like. In theory CQL could support these features in the future. Cassandra would need a new query compiler and query planner. I don't see how the current design could do these things without a significant redesign/enhancement. In a past life, I implemented an inference rule engine, so I've spent over decade studying and implementing query optimizers. All of these things can be done, it's just a matter of people finding the time to do it. I see what your saying. CQL started as a way to make slice easier but it is not even a query language, retrofitting these things is going to be very hard. On Tue, Mar 11, 2014 at 7:45 PM, Peter Lin wool...@gmail.com wrote: I have no problems maintain my own fork :) or joining others forking cassandra. I'd be happy to work with you or anyone else to add features to thrift. That's the great thing about open source. Each person can scratch a technical itch and do what they love. I see lots of potential for Cassandra and many of them include improving thrift to make it happen. Some of the features in theory could be done in
Re: Proposal: freeze Thrift starting with 2.1.0
I would love to see Cassandra get to the point where users can define complex queries with subqueries, like, group by and joins -- Did you have a look at Intravert ? I think it does union intersection on server side for you. Not sure about join though.. On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin wool...@gmail.com wrote: Hi Ed, I agree Solr is deeply integrated into DSE. I've looked at Solandra in the past and studied the code. My understanding is DSE uses Cassandra for storage and the user has both API available. I do think it can be integrated further to make moderate to complex queries easier and probably faster. That's why we built our own JPA-like object query API. I would love to see Cassandra get to the point where users can define complex queries with subqueries, like, group by and joins. Clearly lots of people want these features and even google built their own tools to do these types of queries. I see lots of people trying to improve this with Presto, Impala, drill, etc. To me, it's a natural progression as NoSql databases mature. For most people, at some point you want to be able to report/analyze the data. Today some people use MapReduce to summarize the data and ETL it into a relational database or OLAP database for reporting. Even though I don't need CAS or atomic batch for what I do in cassandra today, I'm sure in the future it will be handy. From my experience in the financial and insurance sector, features like CAS and select for update are important for the kinds of transactions they handle. I'm bias, these kinds of features are useful and good addition to cassandra. These are interesting times in database land! On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Peter, Solr is deeply integrated into DSE. Seemingly this can not efficiently be done client side (CQL/Thrift whatever) but the Solandra approach was to embed Solr in Cassandra. I think that is actually the future client dev, allowing users to embedded custom server side logic into there own API. Things like this take a while. Back in the day no one wanted cassandra to be heavy-weight and rejected ideas like read-before write operations. The common advice was do them client side. Now in the case of collections sometimes they do read-before-write and it is the stuff users want. On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin wool...@gmail.com wrote: I'll give you a concrete example. One of the things we often need to do is do a keyword search on unstructured text. What we did in our tooling is we combined solr with cassandra, but we put an Object API infront of it. The API is inspired by JPA, but designed specifically to fit our needs. the user can do queries with like %blah% and behind the scenes we issues a query to solr to find the keys and then query cassandra for the records. With plain Cassandra, the developer has to manually do all of this stuff and integrate solr. Then they have to know which system to query and in what order. Our tooling lets the user define the schema in a modeler. Once the model is done, it compiles the classes, configuration files, data access objects and unit tests. when the application makes a call, our query classes handle the details behind the scene. I know lots of people would like to see Solr integrated more deeply into Cassandra and CQL. I hope it happens in the future. If DataStax accepts my talk, we will be showing our temporal database and modeler in september. On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt srobe...@stanford.edu wrote: I should add that I'm not trying to ignite a flame war. Just trying to understand your intentions. On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt srobe...@stanford.edu wrote: Okay, I'm officially lost on this thread. If you plan on forking Cassandra to preserve and continue to enhance the Thrift interface, you would also want to add a bunch of relational features to CQL as part of that same fork? On Tue, Mar 11, 2014 at 6:20 PM, Edward Capriolo edlinuxg...@gmail.com wrote: one of the things I'd like to see happen is for Cassandra to support queries with disjunction, exist, subqueries, joins and like. In theory CQL could support these features in the future. Cassandra would need a new query compiler and query planner. I don't see how the current design could do these things without a significant redesign/enhancement. In a past life, I implemented an inference rule engine, so I've spent over decade studying and implementing query optimizers. All of these things can be done, it's just a matter of people finding the time to do it. I see what your saying. CQL started as a way to make slice easier but it is not even a query language, retrofitting these things is going to be very hard. On Tue, Mar 11, 2014 at 7:45 PM, Peter Lin wool...@gmail.com wrote: I have no problems maintain my own fork :) or joining
Re: Proposal: freeze Thrift starting with 2.1.0
yes, I was looking at intravert last nite. For the kinds of reports my customers ask us to do, joins and subqueries are important. Having tried to do a simple join in PIG, the level of pain is high. I'm a masochist, so I don't mind breaking a simple join into multiple MR tasks, though I do find myself asking why the hell does it need to be so painful in PIG? Many of my friends say what is this crap! or this is better than writing sql queries to run reports? Plus, using ETL techniques to extract summaries only works for cases where the data is small enough. Once it gets beyond a certain size, it's not practical, which means we're back to crappy reporting languages that make life painful. Lots of big healthcare companies have thousands of MOLAP cubes on dozens of mainframes. The old OLTP - DW/OLAP creates it's own set of management headaches. being able to report directly on the raw data avoids many of the issues, but that's my bias perspective. On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com wrote: I would love to see Cassandra get to the point where users can define complex queries with subqueries, like, group by and joins -- Did you have a look at Intravert ? I think it does union intersection on server side for you. Not sure about join though.. On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin wool...@gmail.com wrote: Hi Ed, I agree Solr is deeply integrated into DSE. I've looked at Solandra in the past and studied the code. My understanding is DSE uses Cassandra for storage and the user has both API available. I do think it can be integrated further to make moderate to complex queries easier and probably faster. That's why we built our own JPA-like object query API. I would love to see Cassandra get to the point where users can define complex queries with subqueries, like, group by and joins. Clearly lots of people want these features and even google built their own tools to do these types of queries. I see lots of people trying to improve this with Presto, Impala, drill, etc. To me, it's a natural progression as NoSql databases mature. For most people, at some point you want to be able to report/analyze the data. Today some people use MapReduce to summarize the data and ETL it into a relational database or OLAP database for reporting. Even though I don't need CAS or atomic batch for what I do in cassandra today, I'm sure in the future it will be handy. From my experience in the financial and insurance sector, features like CAS and select for update are important for the kinds of transactions they handle. I'm bias, these kinds of features are useful and good addition to cassandra. These are interesting times in database land! On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Peter, Solr is deeply integrated into DSE. Seemingly this can not efficiently be done client side (CQL/Thrift whatever) but the Solandra approach was to embed Solr in Cassandra. I think that is actually the future client dev, allowing users to embedded custom server side logic into there own API. Things like this take a while. Back in the day no one wanted cassandra to be heavy-weight and rejected ideas like read-before write operations. The common advice was do them client side. Now in the case of collections sometimes they do read-before-write and it is the stuff users want. On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin wool...@gmail.com wrote: I'll give you a concrete example. One of the things we often need to do is do a keyword search on unstructured text. What we did in our tooling is we combined solr with cassandra, but we put an Object API infront of it. The API is inspired by JPA, but designed specifically to fit our needs. the user can do queries with like %blah% and behind the scenes we issues a query to solr to find the keys and then query cassandra for the records. With plain Cassandra, the developer has to manually do all of this stuff and integrate solr. Then they have to know which system to query and in what order. Our tooling lets the user define the schema in a modeler. Once the model is done, it compiles the classes, configuration files, data access objects and unit tests. when the application makes a call, our query classes handle the details behind the scene. I know lots of people would like to see Solr integrated more deeply into Cassandra and CQL. I hope it happens in the future. If DataStax accepts my talk, we will be showing our temporal database and modeler in september. On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt srobe...@stanford.edu wrote: I should add that I'm not trying to ignite a flame war. Just trying to understand your intentions. On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt srobe...@stanford.edu wrote: Okay, I'm officially lost on this thread. If you plan on forking Cassandra to preserve and continue to enhance the Thrift interface, you
Re: Proposal: freeze Thrift starting with 2.1.0
just when you thought the thread died First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We¹ve had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I¹ve started on a rewrite of Virgil called Memnon that rides on top of CQL. (I¹d love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We¹ve hitched our wagons to CQL. CQL != Relational. We¹ve had success translating our ³native² schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and underlying CFs. If anything, I think we could add some DESCRIBE information, which would help users with this, along the lines of: (https://issues.apache.org/jira/browse/CASSANDRA-6676) CQL does open up the *opportunity* for users to articulate more complex queries using more familiar syntax. (including future things such as joins, grouping, etc.) To me, that is exciting, and again one of the reasons we are leaning on it. my two cents, brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Peter Lin wool...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, March 12, 2014 at 8:44 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Proposal: freeze Thrift starting with 2.1.0 yes, I was looking at intravert last nite. For the kinds of reports my customers ask us to do, joins and subqueries are important. Having tried to do a simple join in PIG, the level of pain is high. I'm a masochist, so I don't mind breaking a simple join into multiple MR tasks, though I do find myself asking why the hell does it need to be so painful in PIG? Many of my friends say what is this crap! or this is better than writing sql queries to run reports? Plus, using ETL techniques to extract summaries only works for cases where the data is small enough. Once it gets beyond a certain size, it's not practical, which means we're back to crappy reporting languages that make life painful. Lots of big healthcare companies have thousands of MOLAP cubes on dozens of mainframes. The old OLTP - DW/OLAP creates it's own set of management headaches. being able to report directly on the raw data avoids many of the issues, but that's my bias perspective. On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com wrote: I would love to see Cassandra get to the point where users can define complex queries with subqueries, like, group by and joins -- Did you have a look at Intravert ? I think it does union intersection on server side for you. Not sure about join though.. On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin wool...@gmail.com wrote: Hi Ed, I agree Solr is deeply integrated into DSE. I've looked at Solandra in the past and studied the code. My understanding is DSE uses Cassandra for storage and the user has both API available. I do think it can be integrated further to make moderate to complex queries easier and probably faster. That's why we built our own JPA-like object query API. I would love to see Cassandra get to the point where users can define complex queries with subqueries, like, group by and joins. Clearly lots of people want these features and even
Re: Proposal: freeze Thrift starting with 2.1.0
I would love to help with the REST interface, however my point was not to add REST into Cassandra. My point was that if we had an abstract interface that even CQL used to access data, and this interface was made available for other drop in modules to access, then the project becomes extensible as a whole. You get CQL out of the box, but it allows others to create interface projects of their own and keep them up without putting the burden of that maintenance on the core developers. It could also mean that down the line, say if CQL stops working out like Avro and Thrift before it, then pulling it out would be less of a problem. We can even get all cowboy up in here and put CQL in its own project that can grow by itself, as long as an interface in the Cassandra project is made available. On March 12, 2014 at 10:13:34 AM, Brian O'Neill (b...@alumni.brown.edu) wrote: just when you thought the thread died… First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We’ve had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I’ve started on a rewrite of Virgil called Memnon that rides on top of CQL. (I’d love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We’ve hitched our wagons to CQL. CQL != Relational. We’ve had success translating our “native” schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and underlying CFs. If anything, I think we could add some DESCRIBE information, which would help users with this, along the lines of: (https://issues.apache.org/jira/browse/CASSANDRA-6676) CQL does open up the *opportunity* for users to articulate more complex queries using more familiar syntax. (including future things such as joins, grouping, etc.) To me, that is exciting, and again — one of the reasons we are leaning on it. my two cents, brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Peter Lin wool...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, March 12, 2014 at 8:44 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Proposal: freeze Thrift starting with 2.1.0 yes, I was looking at intravert last nite. For the kinds of reports my customers ask us to do, joins and subqueries are important. Having tried to do a simple join in PIG, the level of pain is high. I'm a masochist, so I don't mind breaking a simple join into multiple MR tasks, though I do find myself asking why the hell does it need to be so painful in PIG? Many of my friends say what is this crap! or this is better than writing sql queries to run reports? Plus, using ETL techniques to extract summaries only works for cases where the data is small enough. Once it gets beyond a certain size, it's not practical, which means we're back to crappy reporting languages that make life painful. Lots of big healthcare companies have thousands of MOLAP cubes on dozens of mainframes. The old OLTP - DW/OLAP creates it's own set of management headaches. being able to report directly on the raw data avoids many of the issues, but that's my bias perspective. On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com
Re: Proposal: freeze Thrift starting with 2.1.0
I'm enjoying the discussion also. @Brian I've been looking at spark/shark along with other recent developments the last few years. Berkeley has been doing some interesting stuff. One reason I like Thrift is for type safety and the benefits for query validation and query optimization. One could do similar things with CQL, but it's just more work, especially with dynamic columns. I know others are mixing static with dynamic columns, so I'm not alone. I have no clue how long it will take to get there, but having tools like query explanation is a big time saver. Writing business reports is hard enough, so every bit of help the tool can provide makes it less painful. On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote: just when you thought the thread died... First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We've had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I've started on a rewrite of Virgil called Memnon that rides on top of CQL. (I'd love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We've hitched our wagons to CQL. CQL != Relational. We've had success translating our native schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and underlying CFs. If anything, I think we could add some DESCRIBE information, which would help users with this, along the lines of: (https://issues.apache.org/jira/browse/CASSANDRA-6676) CQL does open up the *opportunity* for users to articulate more complex queries using more familiar syntax. (including future things such as joins, grouping, etc.) To me, that is exciting, and again -- one of the reasons we are leaning on it. my two cents, brian --- Brian O'Neill Chief Technology Officer *Health Market Science* *The Science of Better Results* 2700 Horizon Drive * King of Prussia, PA * 19406 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42 * healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Peter Lin wool...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, March 12, 2014 at 8:44 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Proposal: freeze Thrift starting with 2.1.0 yes, I was looking at intravert last nite. For the kinds of reports my customers ask us to do, joins and subqueries are important. Having tried to do a simple join in PIG, the level of pain is high. I'm a masochist, so I don't mind breaking a simple join into multiple MR tasks, though I do find myself asking why the hell does it need to be so painful in PIG? Many of my friends say what is this crap! or this is better than writing sql queries to run reports? Plus, using ETL techniques to extract summaries only works for cases where the data is small enough. Once it gets beyond a certain size, it's not practical, which means we're back to crappy reporting languages that make life painful. Lots of big healthcare companies have thousands of MOLAP cubes on dozens of mainframes. The old OLTP - DW/OLAP creates it's own set of management headaches. being able to report directly on the raw data avoids many of the issues, but that's my bias perspective. On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com wrote: I would love to see Cassandra get to the point where
Re: Proposal: freeze Thrift starting with 2.1.0
Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift. I agree with Edward that it's unfortunate that there are no official drivers being maintained by the Cassandra maintainers -- even though the current state with the Datastax drivers is in practice very close (it is not the same thing though). However, I don't agree that not having drivers in the same repo/project is a problem. Whether or not there's a Java driver in the Cassandra source or not doesn't matter at all to us non-Java developers, and I don't see any difference between the situation where there's no driver in the source or just a Java driver. I might have misunderstood Edwards point about this, though. The CQL protocol is the key, as others have mentioned. As long as that is maintained, and respected I think it's absolutely fine not having any drivers shipped as part of Cassandra. However, I feel as this has not been the case lately. I'm thinking particularly about the UDT feature of 2.1, which is not a part of the CQL spec. There is no documentation on how drivers should handle them and what a user should be able to expect from a driver, they're completely implemented as custom types. I hope this will be fixed before 2.1 is released (and there's been good discussions on the mailing lists about how a driver should handle UDTs), but it shows a problem with the the-spec-is-the-thruth argument. I think we'll be fine as long as the spec is the truth, but that requires the spec to be the truth and new features to not be bolted on outside of the spec. T# On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin wool...@gmail.com wrote: I'm enjoying the discussion also. @Brian I've been looking at spark/shark along with other recent developments the last few years. Berkeley has been doing some interesting stuff. One reason I like Thrift is for type safety and the benefits for query validation and query optimization. One could do similar things with CQL, but it's just more work, especially with dynamic columns. I know others are mixing static with dynamic columns, so I'm not alone. I have no clue how long it will take to get there, but having tools like query explanation is a big time saver. Writing business reports is hard enough, so every bit of help the tool can provide makes it less painful. On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote: just when you thought the thread died... First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We've had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I've started on a rewrite of Virgil called Memnon that rides on top of CQL. (I'd love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We've hitched our wagons to CQL. CQL != Relational. We've had success translating our native schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and underlying CFs. If anything, I think we could add some DESCRIBE information, which would help users with this, along the lines of: (https://issues.apache.org/jira/browse/CASSANDRA-6676) CQL does open up the *opportunity* for users to articulate more complex queries using more familiar syntax. (including future things such as joins, grouping, etc.) To me, that is exciting, and again -- one of the reasons we are leaning on it. my two cents, brian --- Brian O'Neill Chief Technology Officer *Health Market Science* *The Science of Better Results* 2700 Horizon Drive * King of Prussia, PA * 19406 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42 * healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender
Re: NetworkTopologyStrategy ring distribution across 2 DC
Thanks. The error is gone if i specify the keyspace name. However the replicas in the ring output is not correct. Shouldn't it say 3 because I have DC1:3, DC2:3 in my schema? thanks Ramesh Datacenter: DC1 == Replicas: 2 AddressRackStatus State LoadOwns Token -9223372036854775808 192.168.1.107 RAC1Up Normal 4.72 MB 42.86% 6588122883467697004 192.168.1.106 RAC1Up Normal 4.73 MB 42.86% 3952873730080618202 192.168.1.105 RAC1Up Normal 4.8 MB 42.86% 1317624576693539400 192.168.1.104 RAC1Up Normal 4.77 MB 42.86% -1317624576693539402 192.168.1.103 RAC1Up Normal 4.83 MB 42.86% -3952873730080618204 192.168.1.102 RAC1Up Normal 4.69 MB 42.86% -6588122883467697006 192.168.1.101 RAC1Up Normal 4.8 MB 42.86% -9223372036854775808 Datacenter: DC2 == Replicas: 2 AddressRackStatus State LoadOwns Token 3952873730080618203 192.168.1.111 RAC1Up Normal 4.73 MB 42.86% -1317624576693539401 192.168.1.110 RAC1Up Normal 4.79 MB 42.86% -3952873730080618203 192.168.1.109 RAC1Up Normal 3.16 MB 42.86% -6588122883467697005 192.168.1.108 RAC1Up Normal 3.22 MB 42.86% -9223372036854775807 192.168.1.114 RAC1Up Normal 4.69 MB 42.86% 6588122883467697005 192.168.1.112 RAC1Up Normal 4.76 MB 42.86% 1317624576693539401 192.168.1.113 RAC1Up Normal 3.19 MB 42.86% 3952873730080618203 On Tue, Mar 11, 2014 at 7:24 PM, Tyler Hobbs ty...@datastax.com wrote: On Tue, Mar 11, 2014 at 1:37 PM, Ramesh Natarajan rames...@gmail.comwrote: Note: Ownership information does not include topology; for complete information, specify a keyspace Also the owns column is 0% for the second DC. Is this normal? Yes. Without a keyspace specified, the Owns column is showing the equivalent of SimpleStrategy with replication_factor=1. If you specify a keyspace, it will take the replication strategy and options into account. -- Tyler Hobbs DataStax http://datastax.com/
Re: Proposal: freeze Thrift starting with 2.1.0
@Theo I totally understand that. Spending time to maintain support for 2 different protocols is a significant overhead. From my own experience contributing to open source projects, time is the biggest limiting factor. My bias perspective, CQL can be extended with additional features so that query validation and optimization is easier. If we look at the history of RDBMS and the development of query planners/optimizers, having the type metadata is important. RDBMS don't have to deal with dynamic columns, since the schema is static. Even then there's dozens of papers from researchers and implementers on how to optimize a query plan. If we look at Data grid products, we see a similar thing. Coherence gives users the ability to query their key/value data and get a query plan. I hope projects like presto, impala, etc will provide these features eventually. I favor thrift for a simple reason. My modeling tool and framework retains the type information, so that makes it easier to build query optimizers. I realize not everyone cares about this kind of stuff and don't have to write complex reports. I'm not suggesting others spend their valuable time improving thrift. At the same time, if I'm willing to work on thrift and the enhancements are acceptable to others, then Cassandra should include them. If not, I'm happy to fork Cassandra and do my own thing. I can't be the only person that needs to do complex reports. peter On Wed, Mar 12, 2014 at 11:20 AM, Theo Hultberg t...@iconara.net wrote: Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift. I agree with Edward that it's unfortunate that there are no official drivers being maintained by the Cassandra maintainers -- even though the current state with the Datastax drivers is in practice very close (it is not the same thing though). However, I don't agree that not having drivers in the same repo/project is a problem. Whether or not there's a Java driver in the Cassandra source or not doesn't matter at all to us non-Java developers, and I don't see any difference between the situation where there's no driver in the source or just a Java driver. I might have misunderstood Edwards point about this, though. The CQL protocol is the key, as others have mentioned. As long as that is maintained, and respected I think it's absolutely fine not having any drivers shipped as part of Cassandra. However, I feel as this has not been the case lately. I'm thinking particularly about the UDT feature of 2.1, which is not a part of the CQL spec. There is no documentation on how drivers should handle them and what a user should be able to expect from a driver, they're completely implemented as custom types. I hope this will be fixed before 2.1 is released (and there's been good discussions on the mailing lists about how a driver should handle UDTs), but it shows a problem with the the-spec-is-the-thruth argument. I think we'll be fine as long as the spec is the truth, but that requires the spec to be the truth and new features to not be bolted on outside of the spec. T# On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin wool...@gmail.com wrote: I'm enjoying the discussion also. @Brian I've been looking at spark/shark along with other recent developments the last few years. Berkeley has been doing some interesting stuff. One reason I like Thrift is for type safety and the benefits for query validation and query optimization. One could do similar things with CQL, but it's just more work, especially with dynamic columns. I know others are mixing static with dynamic columns, so I'm not alone. I have no clue how long it will take to get there, but having tools like query explanation is a big time saver. Writing business reports is hard enough, so every bit of help the tool can provide makes it less painful. On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote: just when you thought the thread died... First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We've had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often
Re: Proposal: freeze Thrift starting with 2.1.0
IME/O one of the best things about Cassandra was the separation of (and I'm over-simplifying a bit, but still): - The transport/API layer - The Datacenter layer - The Storage layer I don't think we're well-served by the construction kit approach. It's difficult enough to evaluate NoSQL without deciding if you should run CQLSandra or Hectorsandra or Intravertandra etc. In tree, or even documented, I agree completely. I've never argued CQL3 is not the best approach for new users. But I've been around long enough that I know precisely what I want to do sometimes and any general purpose API will get in the way of that. I would like the transport/API layer to at least remain pluggable (hackable if you will) in it's current form. I really just want to be able to create my own *Daemon - as I can now - and go on my merry way without having to modify any internals. Much like with compaction strategies and SSTable components. Do you intend to change this current behavior of allowing a custom transport without code modification? (as opposed to changing the daemon class in a script?).
Re: Proposal: freeze Thrift starting with 2.1.0
Great points about the CQL driver and the supposed spec. It shows how a driver living outside the project poses a problem to open source development. How could custom types have been implemented without a spec? In the apache world the saying is If it did not happen on the list, it did not happen. Did that happen here? I still do not understand how and open source apache java database can rely on third party client software to connect to said database. However the committers seem comfortable with this arrangement to the point they are willing to remove support for the other way to connect to the database. Again, I am glad that the project has officially ended support for thrift with this clear decree. For years the project kept saying Thrift is not going anywhere. It was obviously meant literally like the project would do the absolute minimum to support it until they could make the case to remove it completely. On Wed, Mar 12, 2014 at 11:20 AM, Theo Hultberg t...@iconara.net wrote: Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift. I agree with Edward that it's unfortunate that there are no official drivers being maintained by the Cassandra maintainers -- even though the current state with the Datastax drivers is in practice very close (it is not the same thing though). However, I don't agree that not having drivers in the same repo/project is a problem. Whether or not there's a Java driver in the Cassandra source or not doesn't matter at all to us non-Java developers, and I don't see any difference between the situation where there's no driver in the source or just a Java driver. I might have misunderstood Edwards point about this, though. The CQL protocol is the key, as others have mentioned. As long as that is maintained, and respected I think it's absolutely fine not having any drivers shipped as part of Cassandra. However, I feel as this has not been the case lately. I'm thinking particularly about the UDT feature of 2.1, which is not a part of the CQL spec. There is no documentation on how drivers should handle them and what a user should be able to expect from a driver, they're completely implemented as custom types. I hope this will be fixed before 2.1 is released (and there's been good discussions on the mailing lists about how a driver should handle UDTs), but it shows a problem with the the-spec-is-the-thruth argument. I think we'll be fine as long as the spec is the truth, but that requires the spec to be the truth and new features to not be bolted on outside of the spec. T# On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin wool...@gmail.com wrote: I'm enjoying the discussion also. @Brian I've been looking at spark/shark along with other recent developments the last few years. Berkeley has been doing some interesting stuff. One reason I like Thrift is for type safety and the benefits for query validation and query optimization. One could do similar things with CQL, but it's just more work, especially with dynamic columns. I know others are mixing static with dynamic columns, so I'm not alone. I have no clue how long it will take to get there, but having tools like query explanation is a big time saver. Writing business reports is hard enough, so every bit of help the tool can provide makes it less painful. On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote: just when you thought the thread died... First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We've had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I've started on a rewrite of Virgil called Memnon that rides on top of CQL. (I'd love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We've hitched our wagons to CQL. CQL != Relational. We've had success translating our native schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and
Re: Proposal: freeze Thrift starting with 2.1.0
I agree that we are way off the initial topic, but I think we are spot on the most important topic. As seen in various tickets, including #6704 (wide row scanners), #6167 (end-slice termination predicate), the existence of intravert-ug (Cassandra interface to intravert), and a number of others, there is an increasing desire to do more complicated processing, server-side, on a Cassandra cluster. I very much share those goals, and would like to propose the following only partially hand-wavey path forward. Instead of creating a pluggable interface for Thrift, I'd like to create a pluggable interface for arbitrary app-server deep integration. Inspired by both the existence of intravert-ug, as well as there being a long history of various parties embedding tomcat or jetty servlet engines inside Cassandra, I'd like to propose the creation an internal somewhat stable (versioned?) interface that could allow any app server to achieve deep integration with Cassandra, and as a result, these servers could 1) host their own apis (REST, for example 2) extend core functionality by having limited (see triggers and wide row scanners) access to the internals of cassandra The hand wavey part comes because while I have been mulling this about for a while, I have not spent any significant time into looking at the actual surface area of intravert-ug's integration. But, using it as a model, and also keeping in minds the general needs of your more traditional servlet/j2ee containers, I believe we could come up with a reasonable interface to allow any jvm app server to be integrated and maintained in or out of the Cassandra tree. This would satisfy the needs that many of us (Both Ed and I, for example) to have a much greater degree of control over server side execution, and to be able to start building much more interestingly (and simply) tiered applications. Anybody interested in working on a coherent proposal with me? -Tupshin On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote: just when you thought the thread died... First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We've had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I've started on a rewrite of Virgil called Memnon that rides on top of CQL. (I'd love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We've hitched our wagons to CQL. CQL != Relational. We've had success translating our native schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and underlying CFs. If anything, I think we could add some DESCRIBE information, which would help users with this, along the lines of: (https://issues.apache.org/jira/browse/CASSANDRA-6676) CQL does open up the *opportunity* for users to articulate more complex queries using more familiar syntax. (including future things such as joins, grouping, etc.) To me, that is exciting, and again -- one of the reasons we are leaning on it. my two cents, brian --- Brian O'Neill Chief Technology Officer *Health Market Science* *The Science of Better Results* 2700 Horizon Drive * King of Prussia, PA * 19406 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42 * healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Peter Lin wool...@gmail.com Reply-To:
Re: Proposal: freeze Thrift starting with 2.1.0
@Nate I don't want to change the separation of components in cassandra. My ultimate goal is make writing complex queries less painful and more efficient. How that becomes reality is anyone's guess. There's different ways to get there. I also like having a plugging transport layer, which is why I feel sad every time I hear people say thrift is dead or thrift is frozen beyond 2.1 or don't use thrift. When people ask me what to learn with Cassandra, I say both thrift and CQL. Not everyone has time to read the native protocol spec or dive into cassandra code, but clearly some people do and enjoy it. I understand some people don't want the burden of maintaining Thrift, and it's totally valid. It's up to those that want to keep thrift to make sure patches and enhancements are well tested and solid. On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall n...@thelastpickle.comwrote: IME/O one of the best things about Cassandra was the separation of (and I'm over-simplifying a bit, but still): - The transport/API layer - The Datacenter layer - The Storage layer I don't think we're well-served by the construction kit approach. It's difficult enough to evaluate NoSQL without deciding if you should run CQLSandra or Hectorsandra or Intravertandra etc. In tree, or even documented, I agree completely. I've never argued CQL3 is not the best approach for new users. But I've been around long enough that I know precisely what I want to do sometimes and any general purpose API will get in the way of that. I would like the transport/API layer to at least remain pluggable (hackable if you will) in it's current form. I really just want to be able to create my own *Daemon - as I can now - and go on my merry way without having to modify any internals. Much like with compaction strategies and SSTable components. Do you intend to change this current behavior of allowing a custom transport without code modification? (as opposed to changing the daemon class in a script?).
Re: Proposal: freeze Thrift starting with 2.1.0
@Nate, @Tupshin, this is pretty close to what I had in mind. I would be open to helping out with a formal proposal. On March 12, 2014 at 12:11:41 PM, Tupshin Harper (tups...@tupshin.com) wrote: I agree that we are way off the initial topic, but I think we are spot on the most important topic. As seen in various tickets, including #6704 (wide row scanners), #6167 (end-slice termination predicate), the existence of intravert-ug (Cassandra interface to intravert), and a number of others, there is an increasing desire to do more complicated processing, server-side, on a Cassandra cluster. I very much share those goals, and would like to propose the following only partially hand-wavey path forward. Instead of creating a pluggable interface for Thrift, I'd like to create a pluggable interface for arbitrary app-server deep integration. Inspired by both the existence of intravert-ug, as well as there being a long history of various parties embedding tomcat or jetty servlet engines inside Cassandra, I'd like to propose the creation an internal somewhat stable (versioned?) interface that could allow any app server to achieve deep integration with Cassandra, and as a result, these servers could 1) host their own apis (REST, for example 2) extend core functionality by having limited (see triggers and wide row scanners) access to the internals of cassandra The hand wavey part comes because while I have been mulling this about for a while, I have not spent any significant time into looking at the actual surface area of intravert-ug's integration. But, using it as a model, and also keeping in minds the general needs of your more traditional servlet/j2ee containers, I believe we could come up with a reasonable interface to allow any jvm app server to be integrated and maintained in or out of the Cassandra tree. This would satisfy the needs that many of us (Both Ed and I, for example) to have a much greater degree of control over server side execution, and to be able to start building much more interestingly (and simply) tiered applications. Anybody interested in working on a coherent proposal with me? -Tupshin On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.edu wrote: just when you thought the thread died… First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We’ve had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I’ve started on a rewrite of Virgil called Memnon that rides on top of CQL. (I’d love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We’ve hitched our wagons to CQL. CQL != Relational. We’ve had success translating our “native” schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and underlying CFs. If anything, I think we could add some DESCRIBE information, which would help users with this, along the lines of: (https://issues.apache.org/jira/browse/CASSANDRA-6676) CQL does open up the *opportunity* for users to articulate more complex queries using more familiar syntax. (including future things such as joins, grouping, etc.) To me, that is exciting, and again — one of the reasons we are leaning on it. my two cents, brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by
Re: Proposal: freeze Thrift starting with 2.1.0
Peter, I didn't specifically call it out, but the interface I just proposed in my last email would be very much with the goal of make writing complex queries less painful and more efficient. by providing a deep integration mechanism to host that code. It's very much a enough rope to hang ourselves approach, but badly needed, IMO -Tupshin On Mar 12, 2014 12:12 PM, Peter Lin wool...@gmail.com wrote: @Nate I don't want to change the separation of components in cassandra. My ultimate goal is make writing complex queries less painful and more efficient. How that becomes reality is anyone's guess. There's different ways to get there. I also like having a plugging transport layer, which is why I feel sad every time I hear people say thrift is dead or thrift is frozen beyond 2.1 or don't use thrift. When people ask me what to learn with Cassandra, I say both thrift and CQL. Not everyone has time to read the native protocol spec or dive into cassandra code, but clearly some people do and enjoy it. I understand some people don't want the burden of maintaining Thrift, and it's totally valid. It's up to those that want to keep thrift to make sure patches and enhancements are well tested and solid. On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall n...@thelastpickle.comwrote: IME/O one of the best things about Cassandra was the separation of (and I'm over-simplifying a bit, but still): - The transport/API layer - The Datacenter layer - The Storage layer I don't think we're well-served by the construction kit approach. It's difficult enough to evaluate NoSQL without deciding if you should run CQLSandra or Hectorsandra or Intravertandra etc. In tree, or even documented, I agree completely. I've never argued CQL3 is not the best approach for new users. But I've been around long enough that I know precisely what I want to do sometimes and any general purpose API will get in the way of that. I would like the transport/API layer to at least remain pluggable (hackable if you will) in it's current form. I really just want to be able to create my own *Daemon - as I can now - and go on my merry way without having to modify any internals. Much like with compaction strategies and SSTable components. Do you intend to change this current behavior of allowing a custom transport without code modification? (as opposed to changing the daemon class in a script?).
Re: Proposal: freeze Thrift starting with 2.1.0
@Nate Tupshin, glad to help where I can On Wed, Mar 12, 2014 at 12:14 PM, Russell Bradberry rbradbe...@gmail.comwrote: @Nate, @Tupshin, this is pretty close to what I had in mind. I would be open to helping out with a formal proposal. On March 12, 2014 at 12:11:41 PM, Tupshin Harper (tups...@tupshin.com) wrote: I agree that we are way off the initial topic, but I think we are spot on the most important topic. As seen in various tickets, including #6704 (wide row scanners), #6167 (end-slice termination predicate), the existence of intravert-ug (Cassandra interface to intravert), and a number of others, there is an increasing desire to do more complicated processing, server-side, on a Cassandra cluster. I very much share those goals, and would like to propose the following only partially hand-wavey path forward. Instead of creating a pluggable interface for Thrift, I'd like to create a pluggable interface for arbitrary app-server deep integration. Inspired by both the existence of intravert-ug, as well as there being a long history of various parties embedding tomcat or jetty servlet engines inside Cassandra, I'd like to propose the creation an internal somewhat stable (versioned?) interface that could allow any app server to achieve deep integration with Cassandra, and as a result, these servers could 1) host their own apis (REST, for example 2) extend core functionality by having limited (see triggers and wide row scanners) access to the internals of cassandra The hand wavey part comes because while I have been mulling this about for a while, I have not spent any significant time into looking at the actual surface area of intravert-ug's integration. But, using it as a model, and also keeping in minds the general needs of your more traditional servlet/j2ee containers, I believe we could come up with a reasonable interface to allow any jvm app server to be integrated and maintained in or out of the Cassandra tree. This would satisfy the needs that many of us (Both Ed and I, for example) to have a much greater degree of control over server side execution, and to be able to start building much more interestingly (and simply) tiered applications. Anybody interested in working on a coherent proposal with me? -Tupshin On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote: just when you thought the thread died... First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We've had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I've started on a rewrite of Virgil called Memnon that rides on top of CQL. (I'd love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We've hitched our wagons to CQL. CQL != Relational. We've had success translating our native schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and underlying CFs. If anything, I think we could add some DESCRIBE information, which would help users with this, along the lines of: (https://issues.apache.org/jira/browse/CASSANDRA-6676) CQL does open up the *opportunity* for users to articulate more complex queries using more familiar syntax. (including future things such as joins, grouping, etc.) To me, that is exciting, and again -- one of the reasons we are leaning on it. my two cents, brian --- Brian O'Neill Chief Technology Officer *Health Market Science* *The Science of Better Results* 2700 Horizon Drive * King of Prussia, PA * 19406 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42 * healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email
Re: Proposal: freeze Thrift starting with 2.1.0
@Tupshin LOL, there's always enough rope to hang oneself. I agree it's badly needed for folks that really do need more messy queries. I was just discussing a similar concept with a co-worker and going over the pros/cons of various approaches to realizing the goal. I'm still digging into Presto. I saw some people are working on support for cassandra in presto. On Wed, Mar 12, 2014 at 12:15 PM, Tupshin Harper tups...@tupshin.comwrote: Peter, I didn't specifically call it out, but the interface I just proposed in my last email would be very much with the goal of make writing complex queries less painful and more efficient. by providing a deep integration mechanism to host that code. It's very much a enough rope to hang ourselves approach, but badly needed, IMO -Tupshin On Mar 12, 2014 12:12 PM, Peter Lin wool...@gmail.com wrote: @Nate I don't want to change the separation of components in cassandra. My ultimate goal is make writing complex queries less painful and more efficient. How that becomes reality is anyone's guess. There's different ways to get there. I also like having a plugging transport layer, which is why I feel sad every time I hear people say thrift is dead or thrift is frozen beyond 2.1 or don't use thrift. When people ask me what to learn with Cassandra, I say both thrift and CQL. Not everyone has time to read the native protocol spec or dive into cassandra code, but clearly some people do and enjoy it. I understand some people don't want the burden of maintaining Thrift, and it's totally valid. It's up to those that want to keep thrift to make sure patches and enhancements are well tested and solid. On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall n...@thelastpickle.comwrote: IME/O one of the best things about Cassandra was the separation of (and I'm over-simplifying a bit, but still): - The transport/API layer - The Datacenter layer - The Storage layer I don't think we're well-served by the construction kit approach. It's difficult enough to evaluate NoSQL without deciding if you should run CQLSandra or Hectorsandra or Intravertandra etc. In tree, or even documented, I agree completely. I've never argued CQL3 is not the best approach for new users. But I've been around long enough that I know precisely what I want to do sometimes and any general purpose API will get in the way of that. I would like the transport/API layer to at least remain pluggable (hackable if you will) in it's current form. I really just want to be able to create my own *Daemon - as I can now - and go on my merry way without having to modify any internals. Much like with compaction strategies and SSTable components. Do you intend to change this current behavior of allowing a custom transport without code modification? (as opposed to changing the daemon class in a script?).
Re: Proposal: freeze Thrift starting with 2.1.0
OK, so I'm greatly encouraged by the level of interest in this. I went ahead and created https://issues.apache.org/jira/browse/CASSANDRA-6846, and will be starting to look into what the interface would have to look like. Anybody feel free to continue the discussion here, email me privately, or comment on ticket with your thoughts. -Tupshin On Wed, Mar 12, 2014 at 12:21 PM, Peter Lin wool...@gmail.com wrote: @Tupshin LOL, there's always enough rope to hang oneself. I agree it's badly needed for folks that really do need more messy queries. I was just discussing a similar concept with a co-worker and going over the pros/cons of various approaches to realizing the goal. I'm still digging into Presto. I saw some people are working on support for cassandra in presto. On Wed, Mar 12, 2014 at 12:15 PM, Tupshin Harper tups...@tupshin.comwrote: Peter, I didn't specifically call it out, but the interface I just proposed in my last email would be very much with the goal of make writing complex queries less painful and more efficient. by providing a deep integration mechanism to host that code. It's very much a enough rope to hang ourselves approach, but badly needed, IMO -Tupshin On Mar 12, 2014 12:12 PM, Peter Lin wool...@gmail.com wrote: @Nate I don't want to change the separation of components in cassandra. My ultimate goal is make writing complex queries less painful and more efficient. How that becomes reality is anyone's guess. There's different ways to get there. I also like having a plugging transport layer, which is why I feel sad every time I hear people say thrift is dead or thrift is frozen beyond 2.1 or don't use thrift. When people ask me what to learn with Cassandra, I say both thrift and CQL. Not everyone has time to read the native protocol spec or dive into cassandra code, but clearly some people do and enjoy it. I understand some people don't want the burden of maintaining Thrift, and it's totally valid. It's up to those that want to keep thrift to make sure patches and enhancements are well tested and solid. On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall n...@thelastpickle.comwrote: IME/O one of the best things about Cassandra was the separation of (and I'm over-simplifying a bit, but still): - The transport/API layer - The Datacenter layer - The Storage layer I don't think we're well-served by the construction kit approach. It's difficult enough to evaluate NoSQL without deciding if you should run CQLSandra or Hectorsandra or Intravertandra etc. In tree, or even documented, I agree completely. I've never argued CQL3 is not the best approach for new users. But I've been around long enough that I know precisely what I want to do sometimes and any general purpose API will get in the way of that. I would like the transport/API layer to at least remain pluggable (hackable if you will) in it's current form. I really just want to be able to create my own *Daemon - as I can now - and go on my merry way without having to modify any internals. Much like with compaction strategies and SSTable components. Do you intend to change this current behavior of allowing a custom transport without code modification? (as opposed to changing the daemon class in a script?).
Re: Proposal: freeze Thrift starting with 2.1.0
Awesome! Thanks Tupshin (and everyone else). I'll put some of my thoughts up there shortly. On Wed, Mar 12, 2014 at 11:26 AM, Tupshin Harper tups...@tupshin.comwrote: OK, so I'm greatly encouraged by the level of interest in this. I went ahead and created https://issues.apache.org/jira/browse/CASSANDRA-6846, and will be starting to look into what the interface would have to look like. Anybody feel free to continue the discussion here, email me privately, or comment on ticket with your thoughts. -Tupshin
Re: Proposal: freeze Thrift starting with 2.1.0
@Tushpin I like that approach, right now I think of that piece as the StorageProxy. I agree, over the years people have take that approach. Solandra and is a good example and I am guessing DSE SOLR works this way. This says something about the entire thrift vs cql thing as there are clearly power users writing applications that use neither. I do feel this vote was called to shoot down any attempt to add a feature that was non CQL. However if you think you can drive something like this forward more power to you I will help out. On Wed, Mar 12, 2014 at 12:11 PM, Tupshin Harper tups...@tupshin.comwrote: I agree that we are way off the initial topic, but I think we are spot on the most important topic. As seen in various tickets, including #6704 (wide row scanners), #6167 (end-slice termination predicate), the existence of intravert-ug (Cassandra interface to intravert), and a number of others, there is an increasing desire to do more complicated processing, server-side, on a Cassandra cluster. I very much share those goals, and would like to propose the following only partially hand-wavey path forward. Instead of creating a pluggable interface for Thrift, I'd like to create a pluggable interface for arbitrary app-server deep integration. Inspired by both the existence of intravert-ug, as well as there being a long history of various parties embedding tomcat or jetty servlet engines inside Cassandra, I'd like to propose the creation an internal somewhat stable (versioned?) interface that could allow any app server to achieve deep integration with Cassandra, and as a result, these servers could 1) host their own apis (REST, for example 2) extend core functionality by having limited (see triggers and wide row scanners) access to the internals of cassandra The hand wavey part comes because while I have been mulling this about for a while, I have not spent any significant time into looking at the actual surface area of intravert-ug's integration. But, using it as a model, and also keeping in minds the general needs of your more traditional servlet/j2ee containers, I believe we could come up with a reasonable interface to allow any jvm app server to be integrated and maintained in or out of the Cassandra tree. This would satisfy the needs that many of us (Both Ed and I, for example) to have a much greater degree of control over server side execution, and to be able to start building much more interestingly (and simply) tiered applications. Anybody interested in working on a coherent proposal with me? -Tupshin On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote: just when you thought the thread died... First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We've had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I've started on a rewrite of Virgil called Memnon that rides on top of CQL. (I'd love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We've hitched our wagons to CQL. CQL != Relational. We've had success translating our native schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and underlying CFs. If anything, I think we could add some DESCRIBE information, which would help users with this, along the lines of: (https://issues.apache.org/jira/browse/CASSANDRA-6676) CQL does open up the *opportunity* for users to articulate more complex queries using more familiar syntax. (including future things such as joins, grouping, etc.) To me, that is exciting, and again -- one of the reasons we are leaning on it. my two cents, brian --- Brian O'Neill Chief Technology Officer *Health Market Science* *The Science of Better Results* 2700 Horizon Drive * King of Prussia, PA * 19406 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42 * healthmarketscience.com This information transmitted in this
Re: Proposal: freeze Thrift starting with 2.1.0
On Wed, Mar 12, 2014 at 9:10 AM, Edward Capriolo edlinuxg...@gmail.comwrote: Again, I am glad that the project has officially ended support for thrift with this clear decree. For years the project kept saying Thrift is not going anywhere. It was obviously meant literally like the project would do the absolute minimum to support it until they could make the case to remove it completely. Yes, I didn't realize at the time, but both meanings of not going anywhere were apparently intended. Not going anywhere as in not likely to be removed (for another few major versions at least) but also Not going anywhere as in being the (un/semi/barely-)maintained second class citizen API For the record, I have always presumed that thrift will eventually be removed from the codebase, so for me this new announcement does not generate new surprise or outrage. Separate cannot be equal, and eventually the pain of keeping it in there will outweigh the pain of deprecating it. Even though I do not use CQL3 or the binary protocol and the removal of thrift would force me to do so, having two APIs is so bizarro that I'm left hoping that it *is* eventually deprecated... =Rob
Opscenter help?
I am having a hard time installing the Datastax Opscenter agents on EL6 and EL5 hosts. Where is an appropriate place to ask for help? Datastax has move their forums to Stack Exchange, which seems to be a waste of time, as I don't have enough reputation points to properly tag my questions. The agent installation seems to be broken: [] agent rpm conflicts with sudo [] install from opscenter does not work, even if manually installing the rpm (requres --force, conflicts with sudo) [] error message re: log4j #noconf [] Could not find the main class: opsagent.opsagent. Program will exit. [] No other (helpful/more in-depth) documentation exists -- Drew from Zhrodague post-apocalyptic ad-hoc industrialist d...@zhrodague.net
Java heap size does not change on Windows
I am running Windows Server 2008 R2 Enterprise on a 2 Core Intel Xeon with 16GB of RAM and I want to change the max heap size. I set MAX_HEAP_SIZE in cassandra-env.sh, but when I start Cassandra, it’s still reporting: INFO 12:37:36,221 Global memtable threshold is enabled at 247MB INFO 12:37:36,377 using multi-threaded compaction INFO 12:37:36,705 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.7.0_51 INFO 12:37:36,705 Heap size: 1037959168/1037959168 My question is: how do I change the heap size? Lukas Steiblys
Re: Java heap size does not change on Windows
cassandra-env.sh is only used on *nix systems. You'll need to change bin/cassandra.bat. Interestingly, that's hardcoded to use a 1G heap, which seems like a bug. On Wed, Mar 12, 2014 at 2:40 PM, Lukas Steiblys lu...@doubledutch.mewrote: I am running Windows Server 2008 R2 Enterprise on a 2 Core Intel Xeon with 16GB of RAM and I want to change the max heap size. I set MAX_HEAP_SIZE in cassandra-env.sh, but when I start Cassandra, it's still reporting: INFO 12:37:36,221 Global memtable threshold is enabled at 247MB INFO 12:37:36,377 using multi-threaded compaction INFO 12:37:36,705 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.7.0_51 INFO 12:37:36,705 Heap size: 1037959168/1037959168 My question is: how do I change the heap size? Lukas Steiblys -- Tyler Hobbs DataStax http://datastax.com/
Re: Proposal: freeze Thrift starting with 2.1.0
This brainstorming idea has already been -1 ed in jira. ROFL. On Wed, Mar 12, 2014 at 12:26 PM, Tupshin Harper tups...@tupshin.comwrote: OK, so I'm greatly encouraged by the level of interest in this. I went ahead and created https://issues.apache.org/jira/browse/CASSANDRA-6846, and will be starting to look into what the interface would have to look like. Anybody feel free to continue the discussion here, email me privately, or comment on ticket with your thoughts. -Tupshin On Wed, Mar 12, 2014 at 12:21 PM, Peter Lin wool...@gmail.com wrote: @Tupshin LOL, there's always enough rope to hang oneself. I agree it's badly needed for folks that really do need more messy queries. I was just discussing a similar concept with a co-worker and going over the pros/cons of various approaches to realizing the goal. I'm still digging into Presto. I saw some people are working on support for cassandra in presto. On Wed, Mar 12, 2014 at 12:15 PM, Tupshin Harper tups...@tupshin.comwrote: Peter, I didn't specifically call it out, but the interface I just proposed in my last email would be very much with the goal of make writing complex queries less painful and more efficient. by providing a deep integration mechanism to host that code. It's very much a enough rope to hang ourselves approach, but badly needed, IMO -Tupshin On Mar 12, 2014 12:12 PM, Peter Lin wool...@gmail.com wrote: @Nate I don't want to change the separation of components in cassandra. My ultimate goal is make writing complex queries less painful and more efficient. How that becomes reality is anyone's guess. There's different ways to get there. I also like having a plugging transport layer, which is why I feel sad every time I hear people say thrift is dead or thrift is frozen beyond 2.1 or don't use thrift. When people ask me what to learn with Cassandra, I say both thrift and CQL. Not everyone has time to read the native protocol spec or dive into cassandra code, but clearly some people do and enjoy it. I understand some people don't want the burden of maintaining Thrift, and it's totally valid. It's up to those that want to keep thrift to make sure patches and enhancements are well tested and solid. On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall n...@thelastpickle.comwrote: IME/O one of the best things about Cassandra was the separation of (and I'm over-simplifying a bit, but still): - The transport/API layer - The Datacenter layer - The Storage layer I don't think we're well-served by the construction kit approach. It's difficult enough to evaluate NoSQL without deciding if you should run CQLSandra or Hectorsandra or Intravertandra etc. In tree, or even documented, I agree completely. I've never argued CQL3 is not the best approach for new users. But I've been around long enough that I know precisely what I want to do sometimes and any general purpose API will get in the way of that. I would like the transport/API layer to at least remain pluggable (hackable if you will) in it's current form. I really just want to be able to create my own *Daemon - as I can now - and go on my merry way without having to modify any internals. Much like with compaction strategies and SSTable components. Do you intend to change this current behavior of allowing a custom transport without code modification? (as opposed to changing the daemon class in a script?).
[no subject]
Hello all, The environment: I have a 6 node Cassandra cluster. On each node I have: - 32 G RAM - 24 G RAM for cassa - ~150 - 200 MB/s disk speed - tomcat 6 with axis2 webservice that uses the datastax java driver to make asynch reads / writes - replication factor for the keyspace is 3 All nodes in the same data center The clients that read / write are in the same datacenter so network is Gigabit. Writes are performed via exposed methods from Axis2 WS . The Cassandra Java driver uses the round robin load balancing policy so all the nodes in the cluster should be hit with write requests under heavy write or read load from multiple clients. I am monitoring all nodes with JConsole from another box. The problem: When wrinting to a particular column family, only 3 nodes have high CPU load ~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads timeout. I need more speed for both writes of reads. Due to the fact that 3 nodes barely have CPU activity leads me to think that the whole potential for C* is not touched. I am running out of ideas... If further details about the environment I can provide them. Thank you very much.
Dead node seen as UP by replacement node
Hello, I'm trying to replace a dead node using the procedure in [1], but the replacement node initially sees the dead node as UP, and after a few minutes the node is marked as DOWN again, failing the streaming/bootstrap procedure of the replacement node. This dead node is always seen as DOWN by the rest of the cluster. Could this be a bug? I can easily reproduce it in our production environment, but don't know if it's reproducible in a clean environment. Version: 1.2.13 Here is the log from the replacement node (192.168.1.10 is the dead node): INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843) Node /192.168.1.10 is now part of the cluster INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809) InetAddress /192.168.1.10 is now UP INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823) InetAddress /192.168.1.10 is now DOWN ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java (line 110) Stream failed because /192.168.1.10 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line 246) Streaming from /192.168.1.10 failed ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java (line 110) Stream failed because /192.168.1.10 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line 246) Streaming from /192.168.1.10 failed [1] http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node Cheers, Paulo -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
Re: Dead node seen as UP by replacement node
Some further info: I'm not using Vnodes, so I'm using the 1.1 replace node trick of setting the initial_token in the cassandra.yaml file to the value of the dead node's token -1, and autobootstrap=true. However, according to the Apache wiki ( https://wiki.apache.org/cassandra/Operations#For_versions_1.2.0_and_above), on 1.2 you should actually remove the dead node from the ring, before adding a replacement node. Does that mean the trick of setting the initial token to the value of the dead node's -1 (described in http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node) is not valid anymore in 1.2 without vnodes? On Wed, Mar 12, 2014 at 5:57 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, I'm trying to replace a dead node using the procedure in [1], but the replacement node initially sees the dead node as UP, and after a few minutes the node is marked as DOWN again, failing the streaming/bootstrap procedure of the replacement node. This dead node is always seen as DOWN by the rest of the cluster. Could this be a bug? I can easily reproduce it in our production environment, but don't know if it's reproducible in a clean environment. Version: 1.2.13 Here is the log from the replacement node (192.168.1.10 is the dead node): INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843) Node /192.168.1.10 is now part of the cluster INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809) InetAddress /192.168.1.10 is now UP INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823) InetAddress /192.168.1.10 is now DOWN ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java (line 110) Stream failed because /192.168.1.10 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line 246) Streaming from /192.168.1.10 failed ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java (line 110) Stream failed because /192.168.1.10 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line 246) Streaming from /192.168.1.10 failed [1] http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node Cheers, Paulo -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
Re:
That is too much ram for cassandra make that 6g to 10g. The uneven perf could be because your requests do not shard evenly. On Wednesday, March 12, 2014, Batranut Bogdan batra...@yahoo.com wrote: Hello all, The environment: I have a 6 node Cassandra cluster. On each node I have: - 32 G RAM - 24 G RAM for cassa - ~150 - 200 MB/s disk speed - tomcat 6 with axis2 webservice that uses the datastax java driver to make asynch reads / writes - replication factor for the keyspace is 3 All nodes in the same data center The clients that read / write are in the same datacenter so network is Gigabit. Writes are performed via exposed methods from Axis2 WS . The Cassandra Java driver uses the round robin load balancing policy so all the nodes in the cluster should be hit with write requests under heavy write or read load from multiple clients. I am monitoring all nodes with JConsole from another box. The problem: When wrinting to a particular column family, only 3 nodes have high CPU load ~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads timeout. I need more speed for both writes of reads. Due to the fact that 3 nodes barely have CPU activity leads me to think that the whole potential for C* is not touched. I am running out of ideas... If further details about the environment I can provide them. Thank you very much. -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.
Re:
I wouldn't go above 8G unless you have a very powerful machine that can keep the GC pauses low. Sent from my iPhone On Mar 12, 2014, at 7:11 PM, Edward Capriolo edlinuxg...@gmail.com wrote: That is too much ram for cassandra make that 6g to 10g. The uneven perf could be because your requests do not shard evenly. On Wednesday, March 12, 2014, Batranut Bogdan batra...@yahoo.com wrote: Hello all, The environment: I have a 6 node Cassandra cluster. On each node I have: - 32 G RAM - 24 G RAM for cassa - ~150 - 200 MB/s disk speed - tomcat 6 with axis2 webservice that uses the datastax java driver to make asynch reads / writes - replication factor for the keyspace is 3 All nodes in the same data center The clients that read / write are in the same datacenter so network is Gigabit. Writes are performed via exposed methods from Axis2 WS . The Cassandra Java driver uses the round robin load balancing policy so all the nodes in the cluster should be hit with write requests under heavy write or read load from multiple clients. I am monitoring all nodes with JConsole from another box. The problem: When wrinting to a particular column family, only 3 nodes have high CPU load ~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads timeout. I need more speed for both writes of reads. Due to the fact that 3 nodes barely have CPU activity leads me to think that the whole potential for C* is not touched. I am running out of ideas... If further details about the environment I can provide them. Thank you very much. -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.
Re: Driver documentation questions
While this is a question that would fit better on the Java driver group [1], I'll try to provide a very short answer: 1. Cluster is an long-lived object and the application should have only 1 instance 2. Session is also a long-lived object and you should try to have 1 Session per keyspace. A session manages connection pools for nodes in the cluster and it's an expensive resource. 2.1. In case your application uses a lot of keyspaces, then you should try to limit the number of Sessions and use fully qualified identifiers 3. PreparedStatements should be prepared only once. Session and PreparedStatements are thread-safe and should be shared across your app. [1] https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user On Fri, Mar 7, 2014 at 12:42 PM, Green, John M (HP Education) john.gr...@hp.com wrote: I’ve been tinkering with both the C++ and Java drivers but in neither case have I got a good indication of how threading and resource mgmt should be implemented in a long-lived multi-threaded application server process.That is, what should be the scope of a builder, a cluster, session, and statement. A JDBC connection is typically a per-thread affair.When application server receives a request, it typically a) gets JDBC connection from a connection pool, b) processes the request c) returns the connection to the JDBC connection pool. All the Cassandra driver sample code I’ve seen so far is for single threaded command-line applications so I’m wondering what is thread safe (if anything) and what objects are “expensive” to instantiate. I’m assuming a Session is analogous to a JDBC connection so when a request comes into my multi-threaded application server, I should create a new Session (or find a way to pool Sessions), but should I be creating a new cluster first? What about a builder? John “lost in the abyss” -- :- a) Alex Popescu Sen. Product Manager @ DataStax @al3xandru
750Gb compaction task
After rebalance and cleanup I have leveled CF (SSTable size = 100MB) and a compaction Task that is going to process ~750GB: root@da1-node1:~# nodetool compactionstats pending tasks: 10556 compaction typekeyspace column family completed total unit progress Compaction cafs_chunks chunks 41015024065 808740269082 bytes 5.07% I have no space for this operation, I have 300 Gb only. Is it possible to resolve this situation?