Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
Hi Ed,

I agree Solr is deeply integrated into DSE. I've looked at Solandra in the
past and studied the code.

My understanding is DSE uses Cassandra for storage and the user has both
API available. I do think it can be integrated further to make moderate to
complex queries easier and probably faster. That's why we built our own
JPA-like object query API. I would love to see Cassandra get to the point
where users can define complex queries with subqueries, like, group by and
joins. Clearly lots of people want these features and even google built
their own tools to do these types of queries.

I see lots of people trying to improve this with Presto, Impala, drill,
etc. To me, it's a natural progression as NoSql databases mature. For most
people, at some point you want to be able to report/analyze the data. Today
some people use MapReduce to summarize the data and ETL it into a
relational database or OLAP database for reporting. Even though I don't
need CAS or atomic batch for what I do in cassandra today, I'm sure in the
future it will be handy. From my experience in the financial and insurance
sector, features like CAS and select for update are important for the
kinds of transactions they handle. I'm bias, these kinds of features are
useful and good addition to cassandra.

These are interesting times in database land!




On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Peter,
 Solr is deeply integrated into DSE. Seemingly this can not efficiently be
 done client side (CQL/Thrift whatever) but the Solandra approach was to
 embed Solr in Cassandra. I think that is actually the future client dev,
 allowing users to embedded custom server side logic into there own API.

 Things like this take a while. Back in the day no one wanted cassandra to
 be heavy-weight and rejected ideas like read-before write operations. The
 common advice was do them client side. Now in the case of collections
 sometimes they do read-before-write and it is the stuff users want.



 On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin wool...@gmail.com wrote:


 I'll give you a concrete example.

 One of the things we often need to do is do a keyword search on
 unstructured text. What we did in our tooling is we combined solr with
 cassandra, but we put an Object API infront of it. The API is inspired by
 JPA, but designed specifically to fit our needs.

 the user can do queries with like %blah% and behind the scenes we issues
 a query to solr to find the keys and then query cassandra for the records.

 With plain Cassandra, the developer has to manually do all of this stuff
 and integrate solr. Then they have to know which system to query and in
 what order.  Our tooling lets the user define the schema in a modeler. Once
 the model is done, it compiles the classes, configuration files, data
 access objects and unit tests.

 when the application makes a call, our query classes handle the details
 behind the scene. I know lots of people would like to see Solr integrated
 more deeply into Cassandra and CQL. I hope it happens in the future. If
 DataStax accepts my talk, we will be showing our temporal database and
 modeler in september.




 On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt srobe...@stanford.edu
  wrote:

 I should add that I'm not trying to ignite a flame war. Just trying to
 understand your intentions.


 On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt 
 srobe...@stanford.edu wrote:

 Okay, I'm officially lost on this thread. If you plan on forking
 Cassandra to preserve and continue to enhance the Thrift interface, you
 would also want to add a bunch of relational features to CQL as part of
 that same fork?


 On Tue, Mar 11, 2014 at 6:20 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:

 one of the things I'd like to see happen is for Cassandra to support
 queries with disjunction, exist, subqueries, joins and like. In theory CQL
 could support these features in the future. Cassandra would need a new
 query compiler and query planner. I don't see how the current design could
 do these things without a significant redesign/enhancement. In a past 
 life,
 I implemented an inference rule engine, so I've spent over decade studying
 and implementing query optimizers. All of these things can be done, it's
 just a matter of people finding the time to do it.

 I see what your saying. CQL started as a way to make slice easier but
 it is not even a query language, retrofitting these things is going to be
 very hard.



 On Tue, Mar 11, 2014 at 7:45 PM, Peter Lin wool...@gmail.com wrote:


 I have no problems maintain my own fork :) or joining others forking
 cassandra.

 I'd be happy to work with you or anyone else to add features to
 thrift. That's the great thing about open source. Each person can 
 scratch a
 technical itch and do what they love. I see lots of potential for 
 Cassandra
 and many of them include improving thrift to make it happen. Some of the
 features in theory could be done in 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread DuyHai Doan
I would love to see Cassandra get to the point where users can define
complex queries with subqueries, like, group by and joins -- Did you have
a look at Intravert ? I think it does union  intersection on server side
for you. Not sure about join though..


On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin wool...@gmail.com wrote:


 Hi Ed,

 I agree Solr is deeply integrated into DSE. I've looked at Solandra in the
 past and studied the code.

 My understanding is DSE uses Cassandra for storage and the user has both
 API available. I do think it can be integrated further to make moderate to
 complex queries easier and probably faster. That's why we built our own
 JPA-like object query API. I would love to see Cassandra get to the point
 where users can define complex queries with subqueries, like, group by and
 joins. Clearly lots of people want these features and even google built
 their own tools to do these types of queries.

 I see lots of people trying to improve this with Presto, Impala, drill,
 etc. To me, it's a natural progression as NoSql databases mature. For most
 people, at some point you want to be able to report/analyze the data. Today
 some people use MapReduce to summarize the data and ETL it into a
 relational database or OLAP database for reporting. Even though I don't
 need CAS or atomic batch for what I do in cassandra today, I'm sure in the
 future it will be handy. From my experience in the financial and insurance
 sector, features like CAS and select for update are important for the
 kinds of transactions they handle. I'm bias, these kinds of features are
 useful and good addition to cassandra.

 These are interesting times in database land!




 On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 Peter,
 Solr is deeply integrated into DSE. Seemingly this can not efficiently be
 done client side (CQL/Thrift whatever) but the Solandra approach was to
 embed Solr in Cassandra. I think that is actually the future client dev,
 allowing users to embedded custom server side logic into there own API.

 Things like this take a while. Back in the day no one wanted cassandra to
 be heavy-weight and rejected ideas like read-before write operations. The
 common advice was do them client side. Now in the case of collections
 sometimes they do read-before-write and it is the stuff users want.



 On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin wool...@gmail.com wrote:


 I'll give you a concrete example.

 One of the things we often need to do is do a keyword search on
 unstructured text. What we did in our tooling is we combined solr with
 cassandra, but we put an Object API infront of it. The API is inspired by
 JPA, but designed specifically to fit our needs.

 the user can do queries with like %blah% and behind the scenes we issues
 a query to solr to find the keys and then query cassandra for the records.

 With plain Cassandra, the developer has to manually do all of this stuff
 and integrate solr. Then they have to know which system to query and in
 what order.  Our tooling lets the user define the schema in a modeler. Once
 the model is done, it compiles the classes, configuration files, data
 access objects and unit tests.

 when the application makes a call, our query classes handle the details
 behind the scene. I know lots of people would like to see Solr integrated
 more deeply into Cassandra and CQL. I hope it happens in the future. If
 DataStax accepts my talk, we will be showing our temporal database and
 modeler in september.




 On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt 
 srobe...@stanford.edu wrote:

 I should add that I'm not trying to ignite a flame war. Just trying to
 understand your intentions.


 On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt 
 srobe...@stanford.edu wrote:

 Okay, I'm officially lost on this thread. If you plan on forking
 Cassandra to preserve and continue to enhance the Thrift interface, you
 would also want to add a bunch of relational features to CQL as part of
 that same fork?


 On Tue, Mar 11, 2014 at 6:20 PM, Edward Capriolo 
 edlinuxg...@gmail.com wrote:

 one of the things I'd like to see happen is for Cassandra to support
 queries with disjunction, exist, subqueries, joins and like. In theory 
 CQL
 could support these features in the future. Cassandra would need a new
 query compiler and query planner. I don't see how the current design 
 could
 do these things without a significant redesign/enhancement. In a past 
 life,
 I implemented an inference rule engine, so I've spent over decade 
 studying
 and implementing query optimizers. All of these things can be done, it's
 just a matter of people finding the time to do it.

 I see what your saying. CQL started as a way to make slice easier but
 it is not even a query language, retrofitting these things is going to be
 very hard.



 On Tue, Mar 11, 2014 at 7:45 PM, Peter Lin wool...@gmail.com wrote:


 I have no problems maintain my own fork :) or joining 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
yes, I was looking at intravert last nite.

For the kinds of reports my customers ask us to do, joins and subqueries
are important. Having tried to do a simple join in PIG, the level of pain
is  high. I'm a masochist, so I don't mind breaking a simple join into
multiple MR tasks, though I do find myself asking why the hell does it
need to be so painful in PIG? Many of my friends say what is this crap!
or this is better than writing sql queries to run reports?

Plus, using ETL techniques to extract summaries only works for cases where
the data is small enough. Once it gets beyond a certain size, it's not
practical, which means we're back to crappy reporting languages that make
life painful. Lots of big healthcare companies have thousands of MOLAP
cubes on dozens of mainframes. The old OLTP - DW/OLAP creates it's own set
of management headaches.

being able to report directly on the raw data avoids many of the issues,
but that's my bias perspective.




On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com wrote:

 I would love to see Cassandra get to the point where users can define
 complex queries with subqueries, like, group by and joins -- Did you have
 a look at Intravert ? I think it does union  intersection on server side
 for you. Not sure about join though..


 On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin wool...@gmail.com wrote:


 Hi Ed,

 I agree Solr is deeply integrated into DSE. I've looked at Solandra in
 the past and studied the code.

 My understanding is DSE uses Cassandra for storage and the user has both
 API available. I do think it can be integrated further to make moderate to
 complex queries easier and probably faster. That's why we built our own
 JPA-like object query API. I would love to see Cassandra get to the point
 where users can define complex queries with subqueries, like, group by and
 joins. Clearly lots of people want these features and even google built
 their own tools to do these types of queries.

 I see lots of people trying to improve this with Presto, Impala, drill,
 etc. To me, it's a natural progression as NoSql databases mature. For most
 people, at some point you want to be able to report/analyze the data. Today
 some people use MapReduce to summarize the data and ETL it into a
 relational database or OLAP database for reporting. Even though I don't
 need CAS or atomic batch for what I do in cassandra today, I'm sure in the
 future it will be handy. From my experience in the financial and insurance
 sector, features like CAS and select for update are important for the
 kinds of transactions they handle. I'm bias, these kinds of features are
 useful and good addition to cassandra.

 These are interesting times in database land!




 On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 Peter,
 Solr is deeply integrated into DSE. Seemingly this can not efficiently
 be done client side (CQL/Thrift whatever) but the Solandra approach was to
 embed Solr in Cassandra. I think that is actually the future client dev,
 allowing users to embedded custom server side logic into there own API.

 Things like this take a while. Back in the day no one wanted cassandra
 to be heavy-weight and rejected ideas like read-before write operations.
 The common advice was do them client side. Now in the case of collections
 sometimes they do read-before-write and it is the stuff users want.



 On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin wool...@gmail.com wrote:


 I'll give you a concrete example.

 One of the things we often need to do is do a keyword search on
 unstructured text. What we did in our tooling is we combined solr with
 cassandra, but we put an Object API infront of it. The API is inspired by
 JPA, but designed specifically to fit our needs.

 the user can do queries with like %blah% and behind the scenes we
 issues a query to solr to find the keys and then query cassandra for the
 records.

 With plain Cassandra, the developer has to manually do all of this
 stuff and integrate solr. Then they have to know which system to query and
 in what order.  Our tooling lets the user define the schema in a modeler.
 Once the model is done, it compiles the classes, configuration files, data
 access objects and unit tests.

 when the application makes a call, our query classes handle the details
 behind the scene. I know lots of people would like to see Solr integrated
 more deeply into Cassandra and CQL. I hope it happens in the future. If
 DataStax accepts my talk, we will be showing our temporal database and
 modeler in september.




 On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt 
 srobe...@stanford.edu wrote:

 I should add that I'm not trying to ignite a flame war. Just trying to
 understand your intentions.


 On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt 
 srobe...@stanford.edu wrote:

 Okay, I'm officially lost on this thread. If you plan on forking
 Cassandra to preserve and continue to enhance the Thrift interface, you
 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Brian O'Neill

just when you thought the thread diedŠ


First, let me say we are *WAY* off topic.  But that is a good thing.
I love this community because there are a ton of passionate, smart people.
(often with differing perspectives ;)

RE: Reporting against C* (@Peter Lin)
We¹ve had the same experience.  Pig + Hadoop is painful.  We are
experimenting with Spark/Shark, operating directly against the data.
http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

The Shark layer gives you SQL and caching capabilities that make it easy to
use and fast (for smaller data sets).  In front of this, we are going to add
dimensional aggregations so we can operate at larger scales.  (then the Hive
reports will run against the aggregations)

RE: REST Server (@Russel Bradbury)
We had moderate success with Virgil, which was a REST server built directly
on Thrift.  We built it directly on top of Thrift, so one day it could be
easily embedded in the C* server itself.   It could be deployed separately,
or run an embedded C*.  More often than not, we ended up running it
separately to separate the layers.  (just like Titan and Rexster)  I¹ve
started on a rewrite of Virgil called Memnon that rides on top of CQL. (I¹d
love some help)
https://github.com/boneill42/memnon

RE: CQL vs. Thrift
We¹ve hitched our wagons to CQL.  CQL != Relational.
We¹ve had success translating our ³native² schemas into CQL, including all
the NoSQL goodness of wide-rows, etc.  You just need a good understanding of
how things translate into storage and underlying CFs.  If anything, I think
we could add some DESCRIBE information, which would help users with this,
along the lines of:
(https://issues.apache.org/jira/browse/CASSANDRA-6676)

CQL does open up the *opportunity* for users to articulate more complex
queries using more familiar syntax.  (including future things such as joins,
grouping, etc.)   To me, that is exciting, and again ‹ one of the reasons we
are leaning on it.

my two cents,
brian

---
Brian O'Neill
Chief Technology Officer


Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Peter Lin wool...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, March 12, 2014 at 8:44 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: Proposal: freeze Thrift starting with 2.1.0


yes, I was looking at intravert last nite.

For the kinds of reports my customers ask us to do, joins and subqueries are
important. Having tried to do a simple join in PIG, the level of pain is
high. I'm a masochist, so I don't mind breaking a simple join into multiple
MR tasks, though I do find myself asking why the hell does it need to be so
painful in PIG? Many of my friends say what is this crap! or this is
better than writing sql queries to run reports?

Plus, using ETL techniques to extract summaries only works for cases where
the data is small enough. Once it gets beyond a certain size, it's not
practical, which means we're back to crappy reporting languages that make
life painful. Lots of big healthcare companies have thousands of MOLAP cubes
on dozens of mainframes. The old OLTP - DW/OLAP creates it's own set of
management headaches.

being able to report directly on the raw data avoids many of the issues, but
that's my bias perspective.




On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com wrote:
 I would love to see Cassandra get to the point where users can define complex
 queries with subqueries, like, group by and joins -- Did you have a look at
 Intravert ? I think it does union  intersection on server side for you. Not
 sure about join though..
 
 
 On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin wool...@gmail.com wrote:
 
 Hi Ed,
 
 I agree Solr is deeply integrated into DSE. I've looked at Solandra in the
 past and studied the code.
 
 My understanding is DSE uses Cassandra for storage and the user has both API
 available. I do think it can be integrated further to make moderate to
 complex queries easier and probably faster. That's why we built our own
 JPA-like object query API. I would love to see Cassandra get to the point
 where users can define complex queries with subqueries, like, group by and
 joins. Clearly lots of people want these features and even 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Russell Bradberry
I would love to help with the REST interface, however my point was not to add 
REST into Cassandra.  My point was that if we had an abstract interface that 
even CQL used to access data, and this interface was made available for other 
drop in modules to access, then the project becomes extensible as a whole.  You 
get CQL out of the box, but it allows others to create interface projects of 
their own and keep them up without putting the burden of that maintenance on 
the core developers.

It could also mean that down the line, say if CQL stops working out like Avro 
and Thrift before it, then pulling it out would be less of a problem.  We can 
even get all cowboy up in here and put CQL in its own project that can grow by 
itself, as long as an interface in the Cassandra project is made available.


On March 12, 2014 at 10:13:34 AM, Brian O'Neill (b...@alumni.brown.edu) wrote:


just when you thought the thread died…


First, let me say we are *WAY* off topic.  But that is a good thing.  
I love this community because there are a ton of passionate, smart people. 
(often with differing perspectives ;)

RE: Reporting against C* (@Peter Lin)
We’ve had the same experience.  Pig + Hadoop is painful.  We are experimenting 
with Spark/Shark, operating directly against the data.
http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

The Shark layer gives you SQL and caching capabilities that make it easy to use 
and fast (for smaller data sets).  In front of this, we are going to add 
dimensional aggregations so we can operate at larger scales.  (then the Hive 
reports will run against the aggregations)

RE: REST Server (@Russel Bradbury)
We had moderate success with Virgil, which was a REST server built directly on 
Thrift.  We built it directly on top of Thrift, so one day it could be easily 
embedded in the C* server itself.   It could be deployed separately, or run an 
embedded C*.  More often than not, we ended up running it separately to 
separate the layers.  (just like Titan and Rexster)  I’ve started on a rewrite 
of Virgil called Memnon that rides on top of CQL. (I’d love some help)
https://github.com/boneill42/memnon

RE: CQL vs. Thrift
We’ve hitched our wagons to CQL.  CQL != Relational.  
We’ve had success translating our “native” schemas into CQL, including all the 
NoSQL goodness of wide-rows, etc.  You just need a good understanding of how 
things translate into storage and underlying CFs.  If anything, I think we 
could add some DESCRIBE information, which would help users with this, along 
the lines of:
(https://issues.apache.org/jira/browse/CASSANDRA-6676)

CQL does open up the *opportunity* for users to articulate more complex queries 
using more familiar syntax.  (including future things such as joins, grouping, 
etc.)   To me, that is exciting, and again — one of the reasons we are leaning 
on it.

my two cents,
brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42  •  
healthmarketscience.com

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by persons or 
entities other than the intended recipient is strictly prohibited.
 

From: Peter Lin wool...@gmail.com
Reply-To: user@cassandra.apache.org
Date: Wednesday, March 12, 2014 at 8:44 AM
To: user@cassandra.apache.org user@cassandra.apache.org
Subject: Re: Proposal: freeze Thrift starting with 2.1.0


yes, I was looking at intravert last nite.

For the kinds of reports my customers ask us to do, joins and subqueries are 
important. Having tried to do a simple join in PIG, the level of pain is  high. 
I'm a masochist, so I don't mind breaking a simple join into multiple MR tasks, 
though I do find myself asking why the hell does it need to be so painful in 
PIG? Many of my friends say what is this crap! or this is better than 
writing sql queries to run reports?

Plus, using ETL techniques to extract summaries only works for cases where the 
data is small enough. Once it gets beyond a certain size, it's not practical, 
which means we're back to crappy reporting languages that make life painful. 
Lots of big healthcare companies have thousands of MOLAP cubes on dozens of 
mainframes. The old OLTP - DW/OLAP creates it's own set of management 
headaches.

being able to report directly on the raw data avoids many of the issues, but 
that's my bias perspective.




On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
I'm enjoying the discussion also.

@Brian
I've been looking at spark/shark along with other recent developments the
last few years. Berkeley has been doing some interesting stuff. One reason
I like Thrift is for type safety and the benefits for query validation and
query optimization. One could do similar things with CQL, but it's just
more work, especially with dynamic columns. I know others are mixing static
with dynamic columns, so I'm not alone. I have no clue how long it will
take to get there, but having tools like query explanation is a big time
saver. Writing business reports is hard enough, so every bit of help the
tool can provide makes it less painful.


On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 just when you thought the thread died...


 First, let me say we are *WAY* off topic.  But that is a good thing.
 I love this community because there are a ton of passionate, smart people.
 (often with differing perspectives ;)

 RE: Reporting against C* (@Peter Lin)
 We've had the same experience.  Pig + Hadoop is painful.  We are
 experimenting with Spark/Shark, operating directly against the data.
 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 The Shark layer gives you SQL and caching capabilities that make it easy
 to use and fast (for smaller data sets).  In front of this, we are going to
 add dimensional aggregations so we can operate at larger scales.  (then the
 Hive reports will run against the aggregations)

 RE: REST Server (@Russel Bradbury)
 We had moderate success with Virgil, which was a REST server built
 directly on Thrift.  We built it directly on top of Thrift, so one day it
 could be easily embedded in the C* server itself.   It could be deployed
 separately, or run an embedded C*.  More often than not, we ended up
 running it separately to separate the layers.  (just like Titan and
 Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
 top of CQL. (I'd love some help)
 https://github.com/boneill42/memnon

 RE: CQL vs. Thrift
 We've hitched our wagons to CQL.  CQL != Relational.
 We've had success translating our native schemas into CQL, including all
 the NoSQL goodness of wide-rows, etc.  You just need a good understanding
 of how things translate into storage and underlying CFs.  If anything, I
 think we could add some DESCRIBE information, which would help users with
 this, along the lines of:
 (https://issues.apache.org/jira/browse/CASSANDRA-6676)

 CQL does open up the *opportunity* for users to articulate more complex
 queries using more familiar syntax.  (including future things such as
 joins, grouping, etc.)   To me, that is exciting, and again -- one of the
 reasons we are leaning on it.

 my two cents,
 brian

 ---

 Brian O'Neill

 Chief Technology Officer


 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive * King of Prussia, PA * 19406

 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42  *

 healthmarketscience.com


 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.




 From: Peter Lin wool...@gmail.com
 Reply-To: user@cassandra.apache.org
 Date: Wednesday, March 12, 2014 at 8:44 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Proposal: freeze Thrift starting with 2.1.0


 yes, I was looking at intravert last nite.

 For the kinds of reports my customers ask us to do, joins and subqueries
 are important. Having tried to do a simple join in PIG, the level of pain
 is  high. I'm a masochist, so I don't mind breaking a simple join into
 multiple MR tasks, though I do find myself asking why the hell does it
 need to be so painful in PIG? Many of my friends say what is this crap!
 or this is better than writing sql queries to run reports?

 Plus, using ETL techniques to extract summaries only works for cases where
 the data is small enough. Once it gets beyond a certain size, it's not
 practical, which means we're back to crappy reporting languages that make
 life painful. Lots of big healthcare companies have thousands of MOLAP
 cubes on dozens of mainframes. The old OLTP - DW/OLAP creates it's own set
 of management headaches.

 being able to report directly on the raw data avoids many of the issues,
 but that's my bias perspective.




 On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com wrote:

 I would love to see Cassandra get to the point where 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Theo Hultberg
Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift.

I agree with Edward that it's unfortunate that there are no official
drivers being maintained by the Cassandra maintainers -- even though the
current state with the Datastax drivers is in practice very close (it is
not the same thing though).

However, I don't agree that not having drivers in the same repo/project is
a problem. Whether or not there's a Java driver in the Cassandra source or
not doesn't matter at all to us non-Java developers, and I don't see any
difference between the situation where there's no driver in the source or
just a Java driver. I might have misunderstood Edwards point about this,
though.

The CQL protocol is the key, as others have mentioned. As long as that is
maintained, and respected I think it's absolutely fine not having any
drivers shipped as part of Cassandra. However, I feel as this has not been
the case lately. I'm thinking particularly about the UDT feature of 2.1,
which is not a part of the CQL spec. There is no documentation on how
drivers should handle them and what a user should be able to expect from a
driver, they're completely implemented as custom types.

I hope this will be fixed before 2.1 is released (and there's been good
discussions on the mailing lists about how a driver should handle UDTs),
but it shows a problem with the the-spec-is-the-thruth argument. I think
we'll be fine as long as the spec is the truth, but that requires the spec
to be the truth and new features to not be bolted on outside of the spec.

T#


On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin wool...@gmail.com wrote:

 I'm enjoying the discussion also.

 @Brian
 I've been looking at spark/shark along with other recent developments the
 last few years. Berkeley has been doing some interesting stuff. One reason
 I like Thrift is for type safety and the benefits for query validation and
 query optimization. One could do similar things with CQL, but it's just
 more work, especially with dynamic columns. I know others are mixing static
 with dynamic columns, so I'm not alone. I have no clue how long it will
 take to get there, but having tools like query explanation is a big time
 saver. Writing business reports is hard enough, so every bit of help the
 tool can provide makes it less painful.


 On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 just when you thought the thread died...


 First, let me say we are *WAY* off topic.  But that is a good thing.
 I love this community because there are a ton of passionate, smart
 people. (often with differing perspectives ;)

 RE: Reporting against C* (@Peter Lin)
 We've had the same experience.  Pig + Hadoop is painful.  We are
 experimenting with Spark/Shark, operating directly against the data.
 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 The Shark layer gives you SQL and caching capabilities that make it easy
 to use and fast (for smaller data sets).  In front of this, we are going to
 add dimensional aggregations so we can operate at larger scales.  (then the
 Hive reports will run against the aggregations)

 RE: REST Server (@Russel Bradbury)
 We had moderate success with Virgil, which was a REST server built
 directly on Thrift.  We built it directly on top of Thrift, so one day it
 could be easily embedded in the C* server itself.   It could be deployed
 separately, or run an embedded C*.  More often than not, we ended up
 running it separately to separate the layers.  (just like Titan and
 Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
 top of CQL. (I'd love some help)
 https://github.com/boneill42/memnon

 RE: CQL vs. Thrift
 We've hitched our wagons to CQL.  CQL != Relational.
 We've had success translating our native schemas into CQL, including
 all the NoSQL goodness of wide-rows, etc.  You just need a good
 understanding of how things translate into storage and underlying CFs.  If
 anything, I think we could add some DESCRIBE information, which would help
 users with this, along the lines of:
 (https://issues.apache.org/jira/browse/CASSANDRA-6676)

 CQL does open up the *opportunity* for users to articulate more complex
 queries using more familiar syntax.  (including future things such as
 joins, grouping, etc.)   To me, that is exciting, and again -- one of the
 reasons we are leaning on it.

 my two cents,
 brian

 ---

 Brian O'Neill

 Chief Technology Officer


 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive * King of Prussia, PA * 19406

 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42  *

 healthmarketscience.com


 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender 

Re: NetworkTopologyStrategy ring distribution across 2 DC

2014-03-12 Thread Ramesh Natarajan
Thanks. The error is gone if i specify the keyspace name. However the
replicas in the ring output is not correct. Shouldn't it say 3 because I
have DC1:3, DC2:3 in my schema?


thanks
Ramesh

Datacenter: DC1
==
Replicas: 2

AddressRackStatus State   LoadOwns
   Token

  -9223372036854775808
192.168.1.107  RAC1Up Normal  4.72 MB 42.86%
   6588122883467697004
192.168.1.106  RAC1Up Normal  4.73 MB 42.86%
   3952873730080618202
192.168.1.105  RAC1Up Normal  4.8 MB  42.86%
   1317624576693539400
192.168.1.104  RAC1Up Normal  4.77 MB 42.86%
   -1317624576693539402
192.168.1.103  RAC1Up Normal  4.83 MB 42.86%
   -3952873730080618204
192.168.1.102  RAC1Up Normal  4.69 MB 42.86%
   -6588122883467697006
192.168.1.101  RAC1Up Normal  4.8 MB  42.86%
   -9223372036854775808

Datacenter: DC2
==
Replicas: 2

AddressRackStatus State   LoadOwns
   Token

  3952873730080618203
192.168.1.111  RAC1Up Normal  4.73 MB 42.86%
   -1317624576693539401
192.168.1.110  RAC1Up Normal  4.79 MB 42.86%
   -3952873730080618203
192.168.1.109  RAC1Up Normal  3.16 MB 42.86%
   -6588122883467697005
192.168.1.108  RAC1Up Normal  3.22 MB 42.86%
   -9223372036854775807
192.168.1.114  RAC1Up Normal  4.69 MB 42.86%
   6588122883467697005
192.168.1.112  RAC1Up Normal  4.76 MB 42.86%
   1317624576693539401
192.168.1.113  RAC1Up Normal  3.19 MB 42.86%
   3952873730080618203


On Tue, Mar 11, 2014 at 7:24 PM, Tyler Hobbs ty...@datastax.com wrote:


 On Tue, Mar 11, 2014 at 1:37 PM, Ramesh Natarajan rames...@gmail.comwrote:


 Note: Ownership information does not include topology; for complete
 information, specify a keyspace

 Also the owns column is 0% for the second DC.

 Is this normal?


 Yes.

 Without a keyspace specified, the Owns column is showing the equivalent of
 SimpleStrategy with replication_factor=1.  If you specify a keyspace, it
 will take the replication strategy and options into account.


 --
 Tyler Hobbs
 DataStax http://datastax.com/



Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
@Theo
I totally understand that. Spending time to maintain support for 2
different protocols is a significant overhead. From my own experience
contributing to open source projects, time is the biggest limiting factor.
My bias perspective, CQL can be extended with additional features so that
query validation and optimization is easier. If we look at the history of
RDBMS and the development of query planners/optimizers, having the type
metadata is important. RDBMS don't have to deal with dynamic columns, since
the schema is static. Even then there's dozens of papers from researchers
and implementers on how to optimize a query plan. If we look at Data grid
products, we see a similar thing. Coherence gives users the ability to
query their key/value data and get a query plan. I hope projects like
presto, impala, etc will provide these features eventually. I favor thrift
for a simple reason. My modeling tool and framework retains the type
information, so that makes it easier to build query optimizers. I realize
not everyone cares about this kind of stuff and don't have to write complex
reports. I'm not suggesting others spend their valuable time improving
thrift. At the same time, if I'm willing to work on thrift and the
enhancements are acceptable to others, then Cassandra should include them.
If not, I'm happy to fork Cassandra and do my own thing. I can't be the
only person that needs to do complex reports.

peter




On Wed, Mar 12, 2014 at 11:20 AM, Theo Hultberg t...@iconara.net wrote:

 Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift.

 I agree with Edward that it's unfortunate that there are no official
 drivers being maintained by the Cassandra maintainers -- even though the
 current state with the Datastax drivers is in practice very close (it is
 not the same thing though).

 However, I don't agree that not having drivers in the same repo/project is
 a problem. Whether or not there's a Java driver in the Cassandra source or
 not doesn't matter at all to us non-Java developers, and I don't see any
 difference between the situation where there's no driver in the source or
 just a Java driver. I might have misunderstood Edwards point about this,
 though.

 The CQL protocol is the key, as others have mentioned. As long as that is
 maintained, and respected I think it's absolutely fine not having any
 drivers shipped as part of Cassandra. However, I feel as this has not been
 the case lately. I'm thinking particularly about the UDT feature of 2.1,
 which is not a part of the CQL spec. There is no documentation on how
 drivers should handle them and what a user should be able to expect from a
 driver, they're completely implemented as custom types.

 I hope this will be fixed before 2.1 is released (and there's been good
 discussions on the mailing lists about how a driver should handle UDTs),
 but it shows a problem with the the-spec-is-the-thruth argument. I think
 we'll be fine as long as the spec is the truth, but that requires the spec
 to be the truth and new features to not be bolted on outside of the spec.

 T#


 On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin wool...@gmail.com wrote:

 I'm enjoying the discussion also.

 @Brian
 I've been looking at spark/shark along with other recent developments the
 last few years. Berkeley has been doing some interesting stuff. One reason
 I like Thrift is for type safety and the benefits for query validation and
 query optimization. One could do similar things with CQL, but it's just
 more work, especially with dynamic columns. I know others are mixing static
 with dynamic columns, so I'm not alone. I have no clue how long it will
 take to get there, but having tools like query explanation is a big time
 saver. Writing business reports is hard enough, so every bit of help the
 tool can provide makes it less painful.


 On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 just when you thought the thread died...


 First, let me say we are *WAY* off topic.  But that is a good thing.
 I love this community because there are a ton of passionate, smart
 people. (often with differing perspectives ;)

 RE: Reporting against C* (@Peter Lin)
 We've had the same experience.  Pig + Hadoop is painful.  We are
 experimenting with Spark/Shark, operating directly against the data.

 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 The Shark layer gives you SQL and caching capabilities that make it easy
 to use and fast (for smaller data sets).  In front of this, we are going to
 add dimensional aggregations so we can operate at larger scales.  (then the
 Hive reports will run against the aggregations)

 RE: REST Server (@Russel Bradbury)
 We had moderate success with Virgil, which was a REST server built
 directly on Thrift.  We built it directly on top of Thrift, so one day it
 could be easily embedded in the C* server itself.   It could be deployed
 separately, or run an embedded C*.  More often 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Nate McCall
IME/O one of the best things about Cassandra was the separation of (and I'm
over-simplifying a bit, but still):

- The transport/API layer
- The Datacenter layer
- The Storage layer


 I don't think we're well-served by the construction kit approach.
 It's difficult enough to evaluate NoSQL without deciding if you should
 run CQLSandra or Hectorsandra or Intravertandra etc.

In tree, or even documented, I agree completely. I've never argued CQL3 is
not the best approach for new users.

But I've been around long enough that I know precisely what I want to do
sometimes and any general purpose API will get in the way of that.

I would like the transport/API layer to at least remain pluggable
(hackable if you will) in it's current form. I really just want to be
able to create my own *Daemon - as I can now - and go on my merry way
without having to modify any internals. Much like with compaction
strategies and SSTable components.

Do you intend to change this current behavior of allowing a custom
transport without code modification? (as opposed to changing the daemon
class in a script?).


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Edward Capriolo
Great points about the CQL driver and the supposed spec. It shows how a
driver living outside the project poses a problem to open source
development. How could custom types have been implemented without a spec?
In the apache world the saying is If it did not happen on the list, it did
not happen. Did that happen here?

I still do not understand how and open source apache java database can rely
on third party client software to connect to said database. However the
committers seem comfortable with this arrangement to the point they are
willing to remove support for the other way to connect to the database.

Again, I am glad that the project has officially ended support for thrift
with this clear decree. For years the project kept saying Thrift is not
going anywhere. It was obviously meant literally like the project would do
the absolute minimum to support it until they could make the case to remove
it completely.




On Wed, Mar 12, 2014 at 11:20 AM, Theo Hultberg t...@iconara.net wrote:

 Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift.

 I agree with Edward that it's unfortunate that there are no official
 drivers being maintained by the Cassandra maintainers -- even though the
 current state with the Datastax drivers is in practice very close (it is
 not the same thing though).

 However, I don't agree that not having drivers in the same repo/project is
 a problem. Whether or not there's a Java driver in the Cassandra source or
 not doesn't matter at all to us non-Java developers, and I don't see any
 difference between the situation where there's no driver in the source or
 just a Java driver. I might have misunderstood Edwards point about this,
 though.

 The CQL protocol is the key, as others have mentioned. As long as that is
 maintained, and respected I think it's absolutely fine not having any
 drivers shipped as part of Cassandra. However, I feel as this has not been
 the case lately. I'm thinking particularly about the UDT feature of 2.1,
 which is not a part of the CQL spec. There is no documentation on how
 drivers should handle them and what a user should be able to expect from a
 driver, they're completely implemented as custom types.

 I hope this will be fixed before 2.1 is released (and there's been good
 discussions on the mailing lists about how a driver should handle UDTs),
 but it shows a problem with the the-spec-is-the-thruth argument. I think
 we'll be fine as long as the spec is the truth, but that requires the spec
 to be the truth and new features to not be bolted on outside of the spec.

 T#


 On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin wool...@gmail.com wrote:

 I'm enjoying the discussion also.

 @Brian
 I've been looking at spark/shark along with other recent developments the
 last few years. Berkeley has been doing some interesting stuff. One reason
 I like Thrift is for type safety and the benefits for query validation and
 query optimization. One could do similar things with CQL, but it's just
 more work, especially with dynamic columns. I know others are mixing static
 with dynamic columns, so I'm not alone. I have no clue how long it will
 take to get there, but having tools like query explanation is a big time
 saver. Writing business reports is hard enough, so every bit of help the
 tool can provide makes it less painful.


 On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 just when you thought the thread died...


 First, let me say we are *WAY* off topic.  But that is a good thing.
 I love this community because there are a ton of passionate, smart
 people. (often with differing perspectives ;)

 RE: Reporting against C* (@Peter Lin)
 We've had the same experience.  Pig + Hadoop is painful.  We are
 experimenting with Spark/Shark, operating directly against the data.

 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 The Shark layer gives you SQL and caching capabilities that make it easy
 to use and fast (for smaller data sets).  In front of this, we are going to
 add dimensional aggregations so we can operate at larger scales.  (then the
 Hive reports will run against the aggregations)

 RE: REST Server (@Russel Bradbury)
 We had moderate success with Virgil, which was a REST server built
 directly on Thrift.  We built it directly on top of Thrift, so one day it
 could be easily embedded in the C* server itself.   It could be deployed
 separately, or run an embedded C*.  More often than not, we ended up
 running it separately to separate the layers.  (just like Titan and
 Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
 top of CQL. (I'd love some help)
 https://github.com/boneill42/memnon

 RE: CQL vs. Thrift
 We've hitched our wagons to CQL.  CQL != Relational.
 We've had success translating our native schemas into CQL, including
 all the NoSQL goodness of wide-rows, etc.  You just need a good
 understanding of how things translate into storage and 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Tupshin Harper
I agree that we are way off the initial topic, but I think we are spot on
the most important topic. As seen in various tickets, including #6704 (wide
row scanners), #6167 (end-slice termination predicate), the existence
of intravert-ug (Cassandra interface to intravert), and a number of others,
there is an increasing desire to do more complicated processing,
server-side, on a Cassandra cluster.

I very much share those goals, and would like to propose the following only
partially hand-wavey path forward.

Instead of creating a pluggable interface for Thrift, I'd like to create a
pluggable interface for arbitrary app-server deep integration.

Inspired by both the existence of intravert-ug, as well as there being a
long history of various parties embedding tomcat or jetty servlet engines
inside Cassandra, I'd like to propose the creation an internal somewhat
stable (versioned?) interface that could allow any app server to achieve
deep integration with Cassandra, and as a result, these servers could
1) host their own apis (REST, for example
2) extend core functionality by having limited (see triggers and wide row
scanners) access to the internals of cassandra

The hand wavey part comes because while I have been mulling this about for
a while, I have not spent any significant time into looking at the actual
surface area of intravert-ug's integration. But, using it as a model, and
also keeping in minds the general needs of your more traditional
servlet/j2ee containers, I believe we could come up with a reasonable
interface to allow any jvm app server to be integrated and maintained in or
out of the Cassandra tree.

This would satisfy the needs that many of us (Both Ed and I, for example)
to have a much greater degree of control over server side execution, and to
be able to start building much more interestingly (and simply) tiered
applications.

Anybody interested in working on a coherent proposal with me?

-Tupshin


On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 just when you thought the thread died...


 First, let me say we are *WAY* off topic.  But that is a good thing.
 I love this community because there are a ton of passionate, smart people.
 (often with differing perspectives ;)

 RE: Reporting against C* (@Peter Lin)
 We've had the same experience.  Pig + Hadoop is painful.  We are
 experimenting with Spark/Shark, operating directly against the data.
 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 The Shark layer gives you SQL and caching capabilities that make it easy
 to use and fast (for smaller data sets).  In front of this, we are going to
 add dimensional aggregations so we can operate at larger scales.  (then the
 Hive reports will run against the aggregations)

 RE: REST Server (@Russel Bradbury)
 We had moderate success with Virgil, which was a REST server built
 directly on Thrift.  We built it directly on top of Thrift, so one day it
 could be easily embedded in the C* server itself.   It could be deployed
 separately, or run an embedded C*.  More often than not, we ended up
 running it separately to separate the layers.  (just like Titan and
 Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
 top of CQL. (I'd love some help)
 https://github.com/boneill42/memnon

 RE: CQL vs. Thrift
 We've hitched our wagons to CQL.  CQL != Relational.
 We've had success translating our native schemas into CQL, including all
 the NoSQL goodness of wide-rows, etc.  You just need a good understanding
 of how things translate into storage and underlying CFs.  If anything, I
 think we could add some DESCRIBE information, which would help users with
 this, along the lines of:
 (https://issues.apache.org/jira/browse/CASSANDRA-6676)

 CQL does open up the *opportunity* for users to articulate more complex
 queries using more familiar syntax.  (including future things such as
 joins, grouping, etc.)   To me, that is exciting, and again -- one of the
 reasons we are leaning on it.

 my two cents,
 brian

 ---

 Brian O'Neill

 Chief Technology Officer


 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive * King of Prussia, PA * 19406

 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42  *

 healthmarketscience.com


 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.




 From: Peter Lin wool...@gmail.com
 Reply-To: 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
@Nate
I don't want to change the separation of components in cassandra. My
ultimate goal is make writing complex queries less painful and more
efficient. How that becomes reality is anyone's guess. There's different
ways to get there. I also like having a plugging transport layer, which is
why I feel sad every time I hear people say thrift is dead or thrift is
frozen beyond 2.1 or don't use thrift. When people ask me what to learn
with Cassandra, I say both thrift and CQL. Not everyone has time to read
the native protocol spec or dive into cassandra code, but clearly some
people do and enjoy it. I understand some people don't want the burden of
maintaining Thrift, and it's totally valid. It's up to those that want to
keep thrift to make sure patches and enhancements are well tested and solid.





On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall n...@thelastpickle.comwrote:

 IME/O one of the best things about Cassandra was the separation of (and
 I'm over-simplifying a bit, but still):

 - The transport/API layer
 - The Datacenter layer
 - The Storage layer


  I don't think we're well-served by the construction kit approach.
  It's difficult enough to evaluate NoSQL without deciding if you should
  run CQLSandra or Hectorsandra or Intravertandra etc.

 In tree, or even documented, I agree completely. I've never argued CQL3 is
 not the best approach for new users.

 But I've been around long enough that I know precisely what I want to do
 sometimes and any general purpose API will get in the way of that.

 I would like the transport/API layer to at least remain pluggable
 (hackable if you will) in it's current form. I really just want to be
 able to create my own *Daemon - as I can now - and go on my merry way
 without having to modify any internals. Much like with compaction
 strategies and SSTable components.

 Do you intend to change this current behavior of allowing a custom
 transport without code modification? (as opposed to changing the daemon
 class in a script?).




Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Russell Bradberry
@Nate, @Tupshin, this is pretty close to what I had in mind. I would be open to 
helping out with a formal proposal.



On March 12, 2014 at 12:11:41 PM, Tupshin Harper (tups...@tupshin.com) wrote:

I agree that we are way off the initial topic, but I think we are spot on the 
most important topic. As seen in various tickets, including #6704 (wide row 
scanners), #6167 (end-slice termination predicate), the existence of 
intravert-ug (Cassandra interface to intravert), and a number of others, there 
is an increasing desire to do more complicated processing, server-side, on a 
Cassandra cluster.

I very much share those goals, and would like to propose the following only 
partially hand-wavey path forward.

Instead of creating a pluggable interface for Thrift, I'd like to create a 
pluggable interface for arbitrary app-server deep integration.

Inspired by both the existence of intravert-ug, as well as there being a long 
history of various parties embedding tomcat or jetty servlet engines inside 
Cassandra, I'd like to propose the creation an internal somewhat stable 
(versioned?) interface that could allow any app server to achieve deep 
integration with Cassandra, and as a result, these servers could 
1) host their own apis (REST, for example
2) extend core functionality by having limited (see triggers and wide row 
scanners) access to the internals of cassandra

The hand wavey part comes because while I have been mulling this about for a 
while, I have not spent any significant time into looking at the actual surface 
area of intravert-ug's integration. But, using it as a model, and also keeping 
in minds the general needs of your more traditional servlet/j2ee containers, I 
believe we could come up with a reasonable interface to allow any jvm app 
server to be integrated and maintained in or out of the Cassandra tree.

This would satisfy the needs that many of us (Both Ed and I, for example) to 
have a much greater degree of control over server side execution, and to be 
able to start building much more interestingly (and simply) tiered applications.

Anybody interested in working on a coherent proposal with me?

-Tupshin


On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.edu wrote:

just when you thought the thread died…


First, let me say we are *WAY* off topic.  But that is a good thing.  
I love this community because there are a ton of passionate, smart people. 
(often with differing perspectives ;)

RE: Reporting against C* (@Peter Lin)
We’ve had the same experience.  Pig + Hadoop is painful.  We are experimenting 
with Spark/Shark, operating directly against the data.
http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

The Shark layer gives you SQL and caching capabilities that make it easy to use 
and fast (for smaller data sets).  In front of this, we are going to add 
dimensional aggregations so we can operate at larger scales.  (then the Hive 
reports will run against the aggregations)

RE: REST Server (@Russel Bradbury)
We had moderate success with Virgil, which was a REST server built directly on 
Thrift.  We built it directly on top of Thrift, so one day it could be easily 
embedded in the C* server itself.   It could be deployed separately, or run an 
embedded C*.  More often than not, we ended up running it separately to 
separate the layers.  (just like Titan and Rexster)  I’ve started on a rewrite 
of Virgil called Memnon that rides on top of CQL. (I’d love some help)
https://github.com/boneill42/memnon

RE: CQL vs. Thrift
We’ve hitched our wagons to CQL.  CQL != Relational.  
We’ve had success translating our “native” schemas into CQL, including all the 
NoSQL goodness of wide-rows, etc.  You just need a good understanding of how 
things translate into storage and underlying CFs.  If anything, I think we 
could add some DESCRIBE information, which would help users with this, along 
the lines of:
(https://issues.apache.org/jira/browse/CASSANDRA-6676)

CQL does open up the *opportunity* for users to articulate more complex queries 
using more familiar syntax.  (including future things such as joins, grouping, 
etc.)   To me, that is exciting, and again — one of the reasons we are leaning 
on it.

my two cents,
brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42  •  
healthmarketscience.com

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Tupshin Harper
Peter,

I didn't specifically call it out, but the interface I just proposed in my
last email would be very much with the goal of make writing complex
queries less painful and more efficient. by providing a deep integration
mechanism to host that code.  It's very much a enough rope to hang
ourselves approach, but badly needed,  IMO

-Tupshin
On Mar 12, 2014 12:12 PM, Peter Lin wool...@gmail.com wrote:


 @Nate
 I don't want to change the separation of components in cassandra. My
 ultimate goal is make writing complex queries less painful and more
 efficient. How that becomes reality is anyone's guess. There's different
 ways to get there. I also like having a plugging transport layer, which is
 why I feel sad every time I hear people say thrift is dead or thrift is
 frozen beyond 2.1 or don't use thrift. When people ask me what to learn
 with Cassandra, I say both thrift and CQL. Not everyone has time to read
 the native protocol spec or dive into cassandra code, but clearly some
 people do and enjoy it. I understand some people don't want the burden of
 maintaining Thrift, and it's totally valid. It's up to those that want to
 keep thrift to make sure patches and enhancements are well tested and solid.





 On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall n...@thelastpickle.comwrote:

 IME/O one of the best things about Cassandra was the separation of (and
 I'm over-simplifying a bit, but still):

 - The transport/API layer
 - The Datacenter layer
 - The Storage layer


  I don't think we're well-served by the construction kit approach.
  It's difficult enough to evaluate NoSQL without deciding if you should
  run CQLSandra or Hectorsandra or Intravertandra etc.

 In tree, or even documented, I agree completely. I've never argued CQL3
 is not the best approach for new users.

 But I've been around long enough that I know precisely what I want to do
 sometimes and any general purpose API will get in the way of that.

 I would like the transport/API layer to at least remain pluggable
 (hackable if you will) in it's current form. I really just want to be
 able to create my own *Daemon - as I can now - and go on my merry way
 without having to modify any internals. Much like with compaction
 strategies and SSTable components.

 Do you intend to change this current behavior of allowing a custom
 transport without code modification? (as opposed to changing the daemon
 class in a script?).





Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
@Nate  Tupshin, glad to help where I can


On Wed, Mar 12, 2014 at 12:14 PM, Russell Bradberry rbradbe...@gmail.comwrote:

 @Nate, @Tupshin, this is pretty close to what I had in mind. I would be
 open to helping out with a formal proposal.



 On March 12, 2014 at 12:11:41 PM, Tupshin Harper (tups...@tupshin.com)
 wrote:

 I agree that we are way off the initial topic, but I think we are spot on
 the most important topic. As seen in various tickets, including #6704 (wide
 row scanners), #6167 (end-slice termination predicate), the existence
 of intravert-ug (Cassandra interface to intravert), and a number of others,
 there is an increasing desire to do more complicated processing,
 server-side, on a Cassandra cluster.

 I very much share those goals, and would like to propose the following
 only partially hand-wavey path forward.

 Instead of creating a pluggable interface for Thrift, I'd like to create a
 pluggable interface for arbitrary app-server deep integration.

 Inspired by both the existence of intravert-ug, as well as there being a
 long history of various parties embedding tomcat or jetty servlet engines
 inside Cassandra, I'd like to propose the creation an internal somewhat
 stable (versioned?) interface that could allow any app server to achieve
 deep integration with Cassandra, and as a result, these servers could
 1) host their own apis (REST, for example
 2) extend core functionality by having limited (see triggers and wide row
 scanners) access to the internals of cassandra

 The hand wavey part comes because while I have been mulling this about for
 a while, I have not spent any significant time into looking at the actual
 surface area of intravert-ug's integration. But, using it as a model, and
 also keeping in minds the general needs of your more traditional
 servlet/j2ee containers, I believe we could come up with a reasonable
 interface to allow any jvm app server to be integrated and maintained in or
 out of the Cassandra tree.

 This would satisfy the needs that many of us (Both Ed and I, for example)
 to have a much greater degree of control over server side execution, and to
 be able to start building much more interestingly (and simply) tiered
 applications.

 Anybody interested in working on a coherent proposal with me?

 -Tupshin


 On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 just when you thought the thread died...


 First, let me say we are *WAY* off topic.  But that is a good thing.
 I love this community because there are a ton of passionate, smart
 people. (often with differing perspectives ;)

 RE: Reporting against C* (@Peter Lin)
 We've had the same experience.  Pig + Hadoop is painful.  We are
 experimenting with Spark/Shark, operating directly against the data.
 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 The Shark layer gives you SQL and caching capabilities that make it easy
 to use and fast (for smaller data sets).  In front of this, we are going to
 add dimensional aggregations so we can operate at larger scales.  (then the
 Hive reports will run against the aggregations)

 RE: REST Server (@Russel Bradbury)
 We had moderate success with Virgil, which was a REST server built
 directly on Thrift.  We built it directly on top of Thrift, so one day it
 could be easily embedded in the C* server itself.   It could be deployed
 separately, or run an embedded C*.  More often than not, we ended up
 running it separately to separate the layers.  (just like Titan and
 Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
 top of CQL. (I'd love some help)
 https://github.com/boneill42/memnon

 RE: CQL vs. Thrift
 We've hitched our wagons to CQL.  CQL != Relational.
 We've had success translating our native schemas into CQL, including
 all the NoSQL goodness of wide-rows, etc.  You just need a good
 understanding of how things translate into storage and underlying CFs.  If
 anything, I think we could add some DESCRIBE information, which would help
 users with this, along the lines of:
 (https://issues.apache.org/jira/browse/CASSANDRA-6676)

 CQL does open up the *opportunity* for users to articulate more complex
 queries using more familiar syntax.  (including future things such as
 joins, grouping, etc.)   To me, that is exciting, and again -- one of the
 reasons we are leaning on it.

 my two cents,
 brian

   ---

 Brian O'Neill

 Chief Technology Officer


  *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive * King of Prussia, PA * 19406

 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42  *

 healthmarketscience.com


   This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
@Tupshin
LOL, there's always enough rope to hang oneself. I agree it's badly needed
for folks that really do need more messy queries. I was just discussing a
similar concept with a co-worker and going over the pros/cons of various
approaches to realizing the goal. I'm still digging into Presto. I saw some
people are working on support for cassandra in presto.



On Wed, Mar 12, 2014 at 12:15 PM, Tupshin Harper tups...@tupshin.comwrote:

 Peter,

 I didn't specifically call it out, but the interface I just proposed in my
 last email would be very much with the goal of make writing complex
 queries less painful and more efficient. by providing a deep integration
 mechanism to host that code.  It's very much a enough rope to hang
 ourselves approach, but badly needed,  IMO

 -Tupshin
 On Mar 12, 2014 12:12 PM, Peter Lin wool...@gmail.com wrote:


 @Nate
 I don't want to change the separation of components in cassandra. My
 ultimate goal is make writing complex queries less painful and more
 efficient. How that becomes reality is anyone's guess. There's different
 ways to get there. I also like having a plugging transport layer, which is
 why I feel sad every time I hear people say thrift is dead or thrift is
 frozen beyond 2.1 or don't use thrift. When people ask me what to learn
 with Cassandra, I say both thrift and CQL. Not everyone has time to read
 the native protocol spec or dive into cassandra code, but clearly some
 people do and enjoy it. I understand some people don't want the burden of
 maintaining Thrift, and it's totally valid. It's up to those that want to
 keep thrift to make sure patches and enhancements are well tested and solid.





 On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall n...@thelastpickle.comwrote:

 IME/O one of the best things about Cassandra was the separation of (and
 I'm over-simplifying a bit, but still):

 - The transport/API layer
 - The Datacenter layer
 - The Storage layer


  I don't think we're well-served by the construction kit approach.
  It's difficult enough to evaluate NoSQL without deciding if you should
  run CQLSandra or Hectorsandra or Intravertandra etc.

 In tree, or even documented, I agree completely. I've never argued CQL3
 is not the best approach for new users.

 But I've been around long enough that I know precisely what I want to do
 sometimes and any general purpose API will get in the way of that.

 I would like the transport/API layer to at least remain pluggable
 (hackable if you will) in it's current form. I really just want to be
 able to create my own *Daemon - as I can now - and go on my merry way
 without having to modify any internals. Much like with compaction
 strategies and SSTable components.

 Do you intend to change this current behavior of allowing a custom
 transport without code modification? (as opposed to changing the daemon
 class in a script?).





Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Tupshin Harper
OK, so I'm greatly encouraged by the level of interest in this. I went
ahead and created https://issues.apache.org/jira/browse/CASSANDRA-6846, and
will be starting to look into what the interface would have to look like.
Anybody feel free to continue the discussion here, email me privately, or
comment on ticket with your thoughts.

-Tupshin


On Wed, Mar 12, 2014 at 12:21 PM, Peter Lin wool...@gmail.com wrote:


 @Tupshin
 LOL, there's always enough rope to hang oneself. I agree it's badly needed
 for folks that really do need more messy queries. I was just discussing a
 similar concept with a co-worker and going over the pros/cons of various
 approaches to realizing the goal. I'm still digging into Presto. I saw some
 people are working on support for cassandra in presto.



 On Wed, Mar 12, 2014 at 12:15 PM, Tupshin Harper tups...@tupshin.comwrote:

 Peter,

 I didn't specifically call it out, but the interface I just proposed in
 my last email would be very much with the goal of make writing complex
 queries less painful and more efficient. by providing a deep integration
 mechanism to host that code.  It's very much a enough rope to hang
 ourselves approach, but badly needed,  IMO

 -Tupshin
 On Mar 12, 2014 12:12 PM, Peter Lin wool...@gmail.com wrote:


 @Nate
 I don't want to change the separation of components in cassandra. My
 ultimate goal is make writing complex queries less painful and more
 efficient. How that becomes reality is anyone's guess. There's different
 ways to get there. I also like having a plugging transport layer, which is
 why I feel sad every time I hear people say thrift is dead or thrift is
 frozen beyond 2.1 or don't use thrift. When people ask me what to learn
 with Cassandra, I say both thrift and CQL. Not everyone has time to read
 the native protocol spec or dive into cassandra code, but clearly some
 people do and enjoy it. I understand some people don't want the burden of
 maintaining Thrift, and it's totally valid. It's up to those that want to
 keep thrift to make sure patches and enhancements are well tested and solid.





 On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall n...@thelastpickle.comwrote:

 IME/O one of the best things about Cassandra was the separation of (and
 I'm over-simplifying a bit, but still):

 - The transport/API layer
 - The Datacenter layer
 - The Storage layer


  I don't think we're well-served by the construction kit approach.
  It's difficult enough to evaluate NoSQL without deciding if you should
  run CQLSandra or Hectorsandra or Intravertandra etc.

 In tree, or even documented, I agree completely. I've never argued CQL3
 is not the best approach for new users.

 But I've been around long enough that I know precisely what I want to
 do sometimes and any general purpose API will get in the way of that.

 I would like the transport/API layer to at least remain pluggable
 (hackable if you will) in it's current form. I really just want to be
 able to create my own *Daemon - as I can now - and go on my merry way
 without having to modify any internals. Much like with compaction
 strategies and SSTable components.

 Do you intend to change this current behavior of allowing a custom
 transport without code modification? (as opposed to changing the daemon
 class in a script?).






Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Nate McCall
Awesome! Thanks Tupshin (and everyone else). I'll put some of my thoughts
up there shortly.

On Wed, Mar 12, 2014 at 11:26 AM, Tupshin Harper tups...@tupshin.comwrote:

 OK, so I'm greatly encouraged by the level of interest in this. I went
 ahead and created https://issues.apache.org/jira/browse/CASSANDRA-6846,
 and will be starting to look into what the interface would have to look
 like. Anybody feel free to continue the discussion here, email me
 privately, or comment on ticket with your thoughts.

 -Tupshin




Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Edward Capriolo
@Tushpin

I like that approach, right now I think of that piece as the
StorageProxy. I agree, over the years people have take that approach.
Solandra and is a good example and I am guessing DSE SOLR works this way.
This says something about the entire thrift vs cql thing as there are
clearly power users writing applications that use neither.

I do feel this vote was called to shoot down any attempt to add a feature
that was non CQL. However if you think you can drive something like this
forward more power to you I will help out.





On Wed, Mar 12, 2014 at 12:11 PM, Tupshin Harper tups...@tupshin.comwrote:

 I agree that we are way off the initial topic, but I think we are spot on
 the most important topic. As seen in various tickets, including #6704 (wide
 row scanners), #6167 (end-slice termination predicate), the existence
 of intravert-ug (Cassandra interface to intravert), and a number of others,
 there is an increasing desire to do more complicated processing,
 server-side, on a Cassandra cluster.

 I very much share those goals, and would like to propose the following
 only partially hand-wavey path forward.

 Instead of creating a pluggable interface for Thrift, I'd like to create a
 pluggable interface for arbitrary app-server deep integration.

 Inspired by both the existence of intravert-ug, as well as there being a
 long history of various parties embedding tomcat or jetty servlet engines
 inside Cassandra, I'd like to propose the creation an internal somewhat
 stable (versioned?) interface that could allow any app server to achieve
 deep integration with Cassandra, and as a result, these servers could
 1) host their own apis (REST, for example
 2) extend core functionality by having limited (see triggers and wide row
 scanners) access to the internals of cassandra

 The hand wavey part comes because while I have been mulling this about for
 a while, I have not spent any significant time into looking at the actual
 surface area of intravert-ug's integration. But, using it as a model, and
 also keeping in minds the general needs of your more traditional
 servlet/j2ee containers, I believe we could come up with a reasonable
 interface to allow any jvm app server to be integrated and maintained in or
 out of the Cassandra tree.

 This would satisfy the needs that many of us (Both Ed and I, for example)
 to have a much greater degree of control over server side execution, and to
 be able to start building much more interestingly (and simply) tiered
 applications.

 Anybody interested in working on a coherent proposal with me?

 -Tupshin


 On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 just when you thought the thread died...


 First, let me say we are *WAY* off topic.  But that is a good thing.
 I love this community because there are a ton of passionate, smart
 people. (often with differing perspectives ;)

 RE: Reporting against C* (@Peter Lin)
 We've had the same experience.  Pig + Hadoop is painful.  We are
 experimenting with Spark/Shark, operating directly against the data.
 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 The Shark layer gives you SQL and caching capabilities that make it easy
 to use and fast (for smaller data sets).  In front of this, we are going to
 add dimensional aggregations so we can operate at larger scales.  (then the
 Hive reports will run against the aggregations)

 RE: REST Server (@Russel Bradbury)
 We had moderate success with Virgil, which was a REST server built
 directly on Thrift.  We built it directly on top of Thrift, so one day it
 could be easily embedded in the C* server itself.   It could be deployed
 separately, or run an embedded C*.  More often than not, we ended up
 running it separately to separate the layers.  (just like Titan and
 Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
 top of CQL. (I'd love some help)
 https://github.com/boneill42/memnon

 RE: CQL vs. Thrift
 We've hitched our wagons to CQL.  CQL != Relational.
 We've had success translating our native schemas into CQL, including
 all the NoSQL goodness of wide-rows, etc.  You just need a good
 understanding of how things translate into storage and underlying CFs.  If
 anything, I think we could add some DESCRIBE information, which would help
 users with this, along the lines of:
 (https://issues.apache.org/jira/browse/CASSANDRA-6676)

 CQL does open up the *opportunity* for users to articulate more complex
 queries using more familiar syntax.  (including future things such as
 joins, grouping, etc.)   To me, that is exciting, and again -- one of the
 reasons we are leaning on it.

 my two cents,
 brian

 ---

 Brian O'Neill

 Chief Technology Officer


 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive * King of Prussia, PA * 19406

 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42  *

 healthmarketscience.com


 This information transmitted in this 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Robert Coli
On Wed, Mar 12, 2014 at 9:10 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Again, I am glad that the project has officially ended support for thrift
 with this clear decree. For years the project kept saying Thrift is not
 going anywhere. It was obviously meant literally like the project would do
 the absolute minimum to support it until they could make the case to remove
 it completely.


Yes, I didn't realize at the time, but both meanings of not going
anywhere were apparently intended.

Not going anywhere as in not likely to be removed (for another few major
versions at least)
but also
Not going anywhere as in being the (un/semi/barely-)maintained second
class citizen API

For the record, I have always presumed that thrift will eventually be
removed from the codebase, so for me this new announcement does not
generate new surprise or outrage. Separate cannot be equal, and eventually
the pain of keeping it in there will outweigh the pain of deprecating it.
Even though I do not use CQL3 or the binary protocol and the removal of
thrift would force me to do so, having two APIs is so bizarro that I'm left
hoping that it *is* eventually deprecated...

=Rob


Opscenter help?

2014-03-12 Thread Drew from Zhrodague
	I am having a hard time installing the Datastax Opscenter agents on EL6 
and EL5 hosts. Where is an appropriate place to ask for help? Datastax 
has move their forums to Stack Exchange, which seems to be a waste of 
time, as I don't have enough reputation points to properly tag my questions.


The agent installation seems to be broken:
[] agent rpm conflicts with sudo
	[] install from opscenter does not work, even if manually installing 
the rpm (requres --force, conflicts with sudo)

[] error message re: log4j #noconf
[] Could not find the main class: opsagent.opsagent. Program will exit.
[] No other (helpful/more in-depth) documentation exists


--

Drew from Zhrodague
post-apocalyptic ad-hoc industrialist
d...@zhrodague.net


Java heap size does not change on Windows

2014-03-12 Thread Lukas Steiblys
I am running Windows Server 2008 R2 Enterprise on a 2 Core Intel Xeon with 16GB 
of RAM and I want to change the max heap size. I set MAX_HEAP_SIZE in 
cassandra-env.sh, but when I start Cassandra, it’s still reporting:

INFO 12:37:36,221 Global memtable threshold is enabled at 247MB
INFO 12:37:36,377 using multi-threaded compaction
INFO 12:37:36,705 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.7.0_51
INFO 12:37:36,705 Heap size: 1037959168/1037959168

My question is: how do I change the heap size?

Lukas Steiblys


Re: Java heap size does not change on Windows

2014-03-12 Thread Tyler Hobbs
cassandra-env.sh is only used on *nix systems.  You'll need to change
bin/cassandra.bat.  Interestingly, that's hardcoded to use a 1G heap, which
seems like a bug.


On Wed, Mar 12, 2014 at 2:40 PM, Lukas Steiblys lu...@doubledutch.mewrote:

   I am running Windows Server 2008 R2 Enterprise on a 2 Core Intel Xeon
 with 16GB of RAM and I want to change the max heap size. I set
 MAX_HEAP_SIZE in cassandra-env.sh, but when I start Cassandra, it's still
 reporting:

 INFO 12:37:36,221 Global memtable threshold is enabled at 247MB
 INFO 12:37:36,377 using multi-threaded compaction
 INFO 12:37:36,705 JVM vendor/version: Java HotSpot(TM) 64-Bit Server
 VM/1.7.0_51
 INFO 12:37:36,705 Heap size: 1037959168/1037959168

 My question is: how do I change the heap size?

 Lukas Steiblys





-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Edward Capriolo
This brainstorming idea has already been -1 ed in jira. ROFL.


On Wed, Mar 12, 2014 at 12:26 PM, Tupshin Harper tups...@tupshin.comwrote:

 OK, so I'm greatly encouraged by the level of interest in this. I went
 ahead and created https://issues.apache.org/jira/browse/CASSANDRA-6846,
 and will be starting to look into what the interface would have to look
 like. Anybody feel free to continue the discussion here, email me
 privately, or comment on ticket with your thoughts.

 -Tupshin


 On Wed, Mar 12, 2014 at 12:21 PM, Peter Lin wool...@gmail.com wrote:


 @Tupshin
 LOL, there's always enough rope to hang oneself. I agree it's badly
 needed for folks that really do need more messy queries. I was just
 discussing a similar concept with a co-worker and going over the pros/cons
 of various approaches to realizing the goal. I'm still digging into Presto.
 I saw some people are working on support for cassandra in presto.



 On Wed, Mar 12, 2014 at 12:15 PM, Tupshin Harper tups...@tupshin.comwrote:

 Peter,

 I didn't specifically call it out, but the interface I just proposed in
 my last email would be very much with the goal of make writing complex
 queries less painful and more efficient. by providing a deep integration
 mechanism to host that code.  It's very much a enough rope to hang
 ourselves approach, but badly needed,  IMO

 -Tupshin
 On Mar 12, 2014 12:12 PM, Peter Lin wool...@gmail.com wrote:


 @Nate
 I don't want to change the separation of components in cassandra. My
 ultimate goal is make writing complex queries less painful and more
 efficient. How that becomes reality is anyone's guess. There's different
 ways to get there. I also like having a plugging transport layer, which is
 why I feel sad every time I hear people say thrift is dead or thrift is
 frozen beyond 2.1 or don't use thrift. When people ask me what to learn
 with Cassandra, I say both thrift and CQL. Not everyone has time to read
 the native protocol spec or dive into cassandra code, but clearly some
 people do and enjoy it. I understand some people don't want the burden of
 maintaining Thrift, and it's totally valid. It's up to those that want to
 keep thrift to make sure patches and enhancements are well tested and 
 solid.





 On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall 
 n...@thelastpickle.comwrote:

 IME/O one of the best things about Cassandra was the separation of
 (and I'm over-simplifying a bit, but still):

 - The transport/API layer
 - The Datacenter layer
 - The Storage layer


  I don't think we're well-served by the construction kit approach.
  It's difficult enough to evaluate NoSQL without deciding if you
 should
  run CQLSandra or Hectorsandra or Intravertandra etc.

 In tree, or even documented, I agree completely. I've never argued
 CQL3 is not the best approach for new users.

 But I've been around long enough that I know precisely what I want to
 do sometimes and any general purpose API will get in the way of that.

 I would like the transport/API layer to at least remain pluggable
 (hackable if you will) in it's current form. I really just want to be
 able to create my own *Daemon - as I can now - and go on my merry way
 without having to modify any internals. Much like with compaction
 strategies and SSTable components.

 Do you intend to change this current behavior of allowing a custom
 transport without code modification? (as opposed to changing the daemon
 class in a script?).







[no subject]

2014-03-12 Thread Batranut Bogdan
Hello all,

The environment:

I have a 6 node Cassandra cluster. On each node I have:
- 32 G RAM
- 24 G RAM for cassa
- ~150 - 200 MB/s disk speed
- tomcat 6 with axis2 webservice that uses the datastax java driver to make
asynch reads / writes 
- replication factor for the keyspace is 3

All nodes in the same data center 
The clients that read / write are in the same datacenter so network is
Gigabit.

Writes are performed via exposed methods from Axis2 WS . The Cassandra Java
driver uses the round robin load balancing policy so all the nodes in the
cluster should be hit with write requests under heavy write or read load
from multiple clients.

I am monitoring all nodes with JConsole from another box.

The problem:

When wrinting to a particular column family, only 3 nodes have high CPU load
~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads
timeout. 

I need more speed for both writes of reads. Due to the fact that 3 nodes
barely have CPU activity leads me to think that the whole potential for C*
is not touched.

I am running out of ideas...

If further details about the environment I can provide them.


Thank you very much.

Dead node seen as UP by replacement node

2014-03-12 Thread Paulo Ricardo Motta Gomes
Hello,

I'm trying to replace a dead node using the procedure in [1], but the
replacement node initially sees the dead node as UP, and after a few
minutes the node is marked as DOWN again, failing the streaming/bootstrap
procedure of the replacement node. This dead node is always seen as DOWN by
the rest of the cluster.

Could this be a bug? I can easily reproduce it in our production
environment, but don't know if it's reproducible in a clean environment.

Version: 1.2.13

Here is the log from the replacement node (192.168.1.10 is the dead node):

 INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843) Node
/192.168.1.10 is now part of the cluster
 INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809)
InetAddress /192.168.1.10 is now UP
 INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823)
InetAddress /192.168.1.10 is now DOWN
ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
(line 110) Stream failed because /192.168.1.10 died or was
restarted/removed (streams may still be active in background, but further
streams won't be started)
 WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line 246)
Streaming from /192.168.1.10 failed
ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
(line 110) Stream failed because /192.168.1.10 died or was
restarted/removed (streams may still be active in background, but further
streams won't be started)
 WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line 246)
Streaming from /192.168.1.10 failed

[1]
http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node

Cheers,

Paulo

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


Re: Dead node seen as UP by replacement node

2014-03-12 Thread Paulo Ricardo Motta Gomes
Some further info:

I'm not using Vnodes, so I'm using the 1.1 replace node trick of setting
the initial_token in the cassandra.yaml file to the value of the dead
node's token -1, and autobootstrap=true. However, according to the Apache
wiki (
https://wiki.apache.org/cassandra/Operations#For_versions_1.2.0_and_above),
on 1.2 you should actually remove the dead node from the ring, before
adding a replacement node.

Does that mean the trick of setting the initial token to the value of the
dead node's -1 (described in
http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node) is
not valid anymore in 1.2 without vnodes?


On Wed, Mar 12, 2014 at 5:57 PM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

 Hello,

 I'm trying to replace a dead node using the procedure in [1], but the
 replacement node initially sees the dead node as UP, and after a few
 minutes the node is marked as DOWN again, failing the streaming/bootstrap
 procedure of the replacement node. This dead node is always seen as DOWN by
 the rest of the cluster.

 Could this be a bug? I can easily reproduce it in our production
 environment, but don't know if it's reproducible in a clean environment.

 Version: 1.2.13

 Here is the log from the replacement node (192.168.1.10 is the dead node):

  INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843)
 Node /192.168.1.10 is now part of the cluster
  INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809)
 InetAddress /192.168.1.10 is now UP
  INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823)
 InetAddress /192.168.1.10 is now DOWN
 ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
 (line 110) Stream failed because /192.168.1.10 died or was
 restarted/removed (streams may still be active in background, but further
 streams won't be started)
  WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line
 246) Streaming from /192.168.1.10 failed
 ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
 (line 110) Stream failed because /192.168.1.10 died or was
 restarted/removed (streams may still be active in background, but further
 streams won't be started)
  WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line
 246) Streaming from /192.168.1.10 failed

 [1]
 http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node

 Cheers,

 Paulo

 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200
 +55 83 9690-1314




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


Re:

2014-03-12 Thread Edward Capriolo
That is too much ram for cassandra make that 6g to 10g.

The uneven perf could be because your requests do not shard evenly.

On Wednesday, March 12, 2014, Batranut Bogdan batra...@yahoo.com wrote:
 Hello all,

 The environment:

 I have a 6 node Cassandra cluster. On each node I have:
 - 32 G RAM
 - 24 G RAM for cassa
 - ~150 - 200 MB/s disk speed
 - tomcat 6 with axis2 webservice that uses the datastax java driver to
make
 asynch reads / writes
 - replication factor for the keyspace is 3

 All nodes in the same data center
 The clients that read / write are in the same datacenter so network is
 Gigabit.

 Writes are performed via exposed methods from Axis2 WS . The Cassandra
Java
 driver uses the round robin load balancing policy so all the nodes in the
 cluster should be hit with write requests under heavy write or read load
 from multiple clients.

 I am monitoring all nodes with JConsole from another box.

 The problem:

 When wrinting to a particular column family, only 3 nodes have high CPU
load
 ~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads
 timeout.

 I need more speed for both writes of reads. Due to the fact that 3 nodes
 barely have CPU activity leads me to think that the whole potential for C*
 is not touched.

 I am running out of ideas...

 If further details about the environment I can provide them.


 Thank you very much.

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re:

2014-03-12 Thread Russ Bradberry
I wouldn't go above 8G unless you have a very powerful machine that can keep 
the GC pauses low.

Sent from my iPhone

 On Mar 12, 2014, at 7:11 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 
 That is too much ram for cassandra make that 6g to 10g. 
 
 The uneven perf could be because your requests do not shard evenly.
 
 On Wednesday, March 12, 2014, Batranut Bogdan batra...@yahoo.com wrote:
  Hello all,
 
  The environment:
 
  I have a 6 node Cassandra cluster. On each node I have:
  - 32 G RAM
  - 24 G RAM for cassa
  - ~150 - 200 MB/s disk speed
  - tomcat 6 with axis2 webservice that uses the datastax java driver to make
  asynch reads / writes 
  - replication factor for the keyspace is 3
 
  All nodes in the same data center 
  The clients that read / write are in the same datacenter so network is
  Gigabit.
 
  Writes are performed via exposed methods from Axis2 WS . The Cassandra Java
  driver uses the round robin load balancing policy so all the nodes in the
  cluster should be hit with write requests under heavy write or read load
  from multiple clients.
 
  I am monitoring all nodes with JConsole from another box.
 
  The problem:
 
  When wrinting to a particular column family, only 3 nodes have high CPU load
  ~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads
  timeout. 
 
  I need more speed for both writes of reads. Due to the fact that 3 nodes
  barely have CPU activity leads me to think that the whole potential for C*
  is not touched.
 
  I am running out of ideas...
 
  If further details about the environment I can provide them.
 
 
  Thank you very much.
 
 -- 
 Sorry this was sent from mobile. Will do less grammar and spell check than 
 usual.


Re: Driver documentation questions

2014-03-12 Thread Alex Popescu
While this is a question that would fit better on the Java driver group
[1], I'll try to provide a very short answer:

1. Cluster is an long-lived object and the application should have only 1
instance
2. Session is also a long-lived object and you should try to have 1 Session
per keyspace.

A session manages connection pools  for nodes in the cluster and it's
an expensive resource.

2.1. In case your application uses a lot of keyspaces, then you should
try to limit the number of Sessions and use fully qualified identifiers

3. PreparedStatements should be prepared only once.

Session and PreparedStatements are thread-safe and should be shared across
your app.

[1]
https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user


On Fri, Mar 7, 2014 at 12:42 PM, Green, John M (HP Education) 
john.gr...@hp.com wrote:

  I’ve been tinkering with both the C++ and Java drivers but in neither
 case have I got a good indication of how threading and resource mgmt should
 be implemented in a long-lived multi-threaded application server
 process.That is, what should be the scope of a builder, a cluster,
 session, and statement.   A JDBC connection is typically a per-thread
 affair.When application server receives a request, it typically

 a)  gets JDBC connection from a connection pool,

 b)  processes the request

 c)   returns the connection to the JDBC connection pool.



 All the Cassandra driver sample code I’ve seen so far is for single
 threaded command-line applications so I’m wondering what is thread safe (if
 anything) and what objects are “expensive” to instantiate.   I’m assuming a
 Session is analogous to a JDBC connection so when a request comes into my
 multi-threaded application server, I should create a new Session (or find a
 way to pool Sessions), but should I be creating a new cluster first?   What
 about a builder?



 John “lost in the abyss”




-- 

:- a)


Alex Popescu
Sen. Product Manager @ DataStax
@al3xandru


750Gb compaction task

2014-03-12 Thread Plotnik, Alexey
After rebalance and cleanup I have leveled CF (SSTable size = 100MB) and a 
compaction Task that is going to process ~750GB:

 root@da1-node1:~# nodetool compactionstats
pending tasks: 10556
  compaction typekeyspace   column family   completed   
total  unit  progress
   Compaction cafs_chunks  chunks 41015024065
808740269082 bytes 5.07%

I have no space for this operation, I have 300 Gb only. Is it possible to 
resolve this situation?