Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Michael Shuler
I updated the fix version on CASSANDRA-6246 to 4.x. The 3.11.x edit was
a bulk move when removing the cassandra-3.X branch and the 3.x Jira
version. There are likely other new feature tickets that should really
say 4.x.

-- 
Kind regards,
Michael

On 02/22/2017 07:28 PM, Kant Kodali wrote:
> I hope that patch is reviewed as quickly as possible. We use LWT's
> heavily and we are getting a throughput of 600 writes/sec and each write
> is 1KB in our case.
> 
> 
> 
> 
> 
> On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo  > wrote:
> 
> 
> 
> On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg  > wrote:
> 
> __
> Hi,
> 
> No it's not going to be in 3.11.x. The earliest release it could
> make it into is 4.0.
> 
> Ariel
> 
> On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
>> Hi Ariel,
>>
>> Can we really expect the fix in 3.11.x as the
>> ticket https://issues.apache.org/jira/browse/CASSANDRA-6246
>> 
>> 
>>  says?
>>
>> Thanks,
>> kant
>>
>> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg
>> > wrote:
>>
>> __
>> Hi,
>>
>> That would work and would help a lot with the dueling
>> proposer issue.
>>
>> A lot of the leader election stuff is designed to reduce
>> the number of roundtrips and not just address the dueling
>> proposer issue. Those will have downtime because it's
>> there for correctness. Just adding an affinity for a
>> specific proposer is probably a free lunch.
>>
>> I don't think you can group keys because the Paxos
>> proposals are per partition which is why we get linear
>> scale out for Paxos. I don't believe it's linearizable
>> across multiple partitions. You can use the clustering key
>> and deterministically pick one of the live replicas for
>> that clustering key. Sort the list of replicas by IP, hash
>> the clustering key, use the hash as an index into the list
>> of replicas.
>>
>> Batching is of limited usefulness because we only use
>> Paxos for CAS I think? So in a batch by definition all but
>> one will fail the CAS. This is something where a
>> distinguished coordinator could help by failing the rest
>> of the contending requests more inexpensively than it
>> currently does.
>>
>>
>> Ariel
>>
>> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>>>
>>>
>>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg
>>> > wrote:
>>>
>>> __
>>> Hi,
>>>
>>> Classic Paxos doesn't have a leader. There are
>>> variants on the original Lamport approach that will
>>> elect a leader (or some other variation like Mencius)
>>> to improve throughput, latency, and performance under
>>> contention. Cassandra implements the approach from
>>> the beginning of "Paxos Made Simple"
>>> (https://goo.gl/SrP0Wb) with no additional
>>> optimizations that I am aware of. There is no
>>> distinguished proposer (leader).
>>>
>>> That paper does go on to discuss electing a
>>> distinguished proposer, but that was never done for
>>> C*. I believe it's not considered a good fit for C*
>>> philosophically.
>>>
>>> Ariel
>>>
>>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
 @Ariel Weisberg EPaxos looks very interesting as it
 looks like it doesn't need any designated leader for
 C* but I am assuming the paxos that is implemented
 today for LWT's requires Leader election and If so,
 don't we need to have an odd number of nodes or
 racks or DC's to satisfy N = 2F + 1 constraint to
 tolerate F failures ? I understand it is not needed
 when not using LWT's since Cassandra is a
 master-less system.

 On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali
 > wrote:

 Thanks Ariel! Yes I knew there are so many
 variations and optimizations of Paxos. I just
 wanted to see if we had any plans on improving
  

Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Kant Kodali
I hope that patch is reviewed as quickly as possible. We use LWT's heavily
and we are getting a throughput of 600 writes/sec and each write is 1KB in
our case.





On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo 
wrote:

>
>
> On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg  wrote:
>
>> Hi,
>>
>> No it's not going to be in 3.11.x. The earliest release it could make it
>> into is 4.0.
>>
>> Ariel
>>
>> On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
>>
>> Hi Ariel,
>>
>> Can we really expect the fix in 3.11.x as the ticket
>> https://issues.apache.org/jira/browse/CASSANDRA-6246
>> 
>>  says?
>>
>> Thanks,
>> kant
>>
>> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg 
>> wrote:
>>
>>
>> Hi,
>>
>> That would work and would help a lot with the dueling proposer issue.
>>
>> A lot of the leader election stuff is designed to reduce the number of
>> roundtrips and not just address the dueling proposer issue. Those will have
>> downtime because it's there for correctness. Just adding an affinity for a
>> specific proposer is probably a free lunch.
>>
>> I don't think you can group keys because the Paxos proposals are per
>> partition which is why we get linear scale out for Paxos. I don't believe
>> it's linearizable across multiple partitions. You can use the clustering
>> key and deterministically pick one of the live replicas for that clustering
>> key. Sort the list of replicas by IP, hash the clustering key, use the hash
>> as an index into the list of replicas.
>>
>> Batching is of limited usefulness because we only use Paxos for CAS I
>> think? So in a batch by definition all but one will fail the CAS. This is
>> something where a distinguished coordinator could help by failing the rest
>> of the contending requests more inexpensively than it currently does.
>>
>>
>> Ariel
>>
>> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>>
>>
>>
>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg 
>> wrote:
>>
>>
>> Hi,
>>
>> Classic Paxos doesn't have a leader. There are variants on the original
>> Lamport approach that will elect a leader (or some other variation like
>> Mencius) to improve throughput, latency, and performance under contention.
>> Cassandra implements the approach from the beginning of "Paxos Made Simple"
>> (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware
>> of. There is no distinguished proposer (leader).
>>
>> That paper does go on to discuss electing a distinguished proposer, but
>> that was never done for C*. I believe it's not considered a good fit for C*
>> philosophically.
>>
>> Ariel
>>
>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>>
>> @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't
>> need any designated leader for C* but I am assuming the paxos that is
>> implemented today for LWT's requires Leader election and If so, don't we
>> need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1
>> constraint to tolerate F failures ? I understand it is not needed when not
>> using LWT's since Cassandra is a master-less system.
>>
>> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali  wrote:
>>
>> Thanks Ariel! Yes I knew there are so many variations and optimizations
>> of Paxos. I just wanted to see if we had any plans on improving the
>> existing Paxos implementation and it is great to see the work is under
>> progress! I am going to follow that ticket and read up the references
>> pointed in it
>>
>>
>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg 
>> wrote:
>>
>>
>> Hi,
>>
>> Cassandra's implementation of Paxos doesn't implement many optimizations
>> that would drastically improve throughput and latency. You need consensus,
>> but it doesn't have to be exorbitantly expensive and fall over under any
>> kind of contention.
>>
>> For instance you could implement EPaxos https://issues.apache.o
>> rg/jira/browse/CASSANDRA-6246
>> ,
>> batch multiple operations into the same Paxos round, have an affinity for a
>> specific proposer for a specific partition, implement asynchronous commit,
>> use a more efficient implementation of the Paxos log, and maybe other
>> things.
>>
>>
>> Ariel
>>
>>
>>
>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>>
>> Hi Kant,
>>
>> If you read the published papers about Paxos, you will most probably
>> recognize that there is no way to "do it better". This is a conceptional
>> thing due to the nature of distributed systems + the CAP theorem.
>> If you want A+P in the triangle, then C is very expensive. CS is made for
>> A+P mostly with tunable C. In ACID databases this is a completely different
>> thing as they are mostly either not partition tolerant, not highly
>> available or not 

Re: Pluggable throttling of read and write queries

2017-02-22 Thread Abhishek Verma
On Wed, Feb 22, 2017 at 4:01 PM, Jay Zhuang  wrote:

> Here is the Scheduler interface: https://github.com/apache/cass
> andra/blob/cassandra-3.11/conf/cassandra.yaml#L978


> Seems like it could be used for this case.

The IRequestScheduler is only used in the thrift Cassandra server code
path:
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/thrift/CassandraServer.java#L1870


Almost all of our customers are using CQL instead of thrift protocol, so we
wont be able to use the request scheduler.

It is removed in 4.x with thrift, not sure why:
> https://github.com/apache/cassandra/commit/4881d9c308ccd6b5c
> a70925bf6ebedb70e7705fc

Since it only works from thrift, it makes sense that it is being removed as
a part of that commit.

-Abhishek.

On 2/22/17 3:39 PM, Eric Stevens wrote:
>
>> We’ve actually had several customers where we’ve done the opposite -
>>>
>> split large clusters apart to separate uses cases
>>
>> We do something similar but for a single application.  We're
>> functionally sharding data to different clusters from a single
>> application.  We can have different server classes for different types
>> of workloads, we can grow and size clusters accordingly, and we also do
>> things like time sharding so that we can let at-rest data go to cheaper
>> storage options.
>>
>> I agree with the general sentiment here that (at least as it stands
>> today) a monolithic cluster for many applications does not compete to
>> per-application clusters unless cost is no issue.  At our scale, the
>> terabytes of C* data we take in per day means that even very small cost
>> savings really add up at scale.  And even where cost is no issue, the
>> additional isolation and workload tailoring is still highly valuable.
>>
>> On Wed, Feb 22, 2017 at 12:01 PM Edward Capriolo > > wrote:
>>
>>
>>
>> On Wed, Feb 22, 2017 at 1:20 PM, Abhishek Verma > > wrote:
>>
>> We have lots of dedicated Cassandra clusters for large use
>> cases, but we have a long tail of (~100) of internal customers
>> who want to store < 200GB of data with < 5k qps and non-critical
>> data. It does not make sense to create a 3 node dedicated
>> cluster for each of these small use cases. So we have a shared
>> cluster into which we onboard these users.
>>
>> But once in a while, one of the customers will run a ingest job
>> from HDFS which will pound the shared cluster and break our SLA
>> for the cluster for all the other customers. Currently, I don't
>> see anyway to signal back pressure to the ingestion jobs or
>> throttle their requests. Another example is one customer doing a
>> large number of range queries which has the same effect.
>>
>> A simple way to avoid this is to throttle the read or write
>> requests based on some quota limits for each keyspace or user.
>>
>> Please see replies inlined:
>>
>> On Mon, Feb 20, 2017 at 11:46 PM, vincent gromakowski
>> > > wrote:
>>
>> Aren't you using mesos Cassandra framework to manage your
>> multiple clusters ? (Seen a presentation in cass summit)
>>
>> Yes we are
>> using https://github.com/mesosphere/dcos-cassandra-service and
>> contribute heavily to it. I am aware of the presentation
>> (https://www.youtube.com/watch?v=4Ap-1VT2ChU) at the Cassandra
>> summit as I was the one who gave it :)
>> This has helped us automate the creation and management of these
>> clusters.
>>
>> What's wrong with your current mesos approach ?
>>
>> Hardware efficiency: Spinning up dedicated clusters for each use
>> case wastes a lot of hardware resources. One of the approaches
>> we have taken is spinning up multiple Cassandra nodes belonging
>> to different clusters on the same physical machine. However, we
>> still have overhead of managing these separate multi-tenant
>> clusters.
>>
>> I am also thinking it's better to split a large cluster into
>> smallers except if you also manage client layer that query
>> cass and you can put some backpressure or rate limit in it.
>>
>> We have an internal storage API layer that some of the clients
>> use, but there are many customers who use the vanilla DataStax
>> Java or Python driver. Implementing throttling in each of those
>> clients does not seem like a viable approach.
>>
>> Le 21 févr. 2017 2:46 AM, "Edward Capriolo"
>> > a
>> écrit :
>>
>>
>> Older versions had a request scheduler api.
>>
>> I am 

Re: Pluggable throttling of read and write queries

2017-02-22 Thread Jay Zhuang
Here is the Scheduler interface: 
https://github.com/apache/cassandra/blob/cassandra-3.11/conf/cassandra.yaml#L978


Seems like it could be used for this case.

It is removed in 4.x with thrift, not sure why: 
https://github.com/apache/cassandra/commit/4881d9c308ccd6b5ca70925bf6ebedb70e7705fc


Thanks,
Jay

On 2/22/17 3:39 PM, Eric Stevens wrote:

We’ve actually had several customers where we’ve done the opposite -

split large clusters apart to separate uses cases

We do something similar but for a single application.  We're
functionally sharding data to different clusters from a single
application.  We can have different server classes for different types
of workloads, we can grow and size clusters accordingly, and we also do
things like time sharding so that we can let at-rest data go to cheaper
storage options.

I agree with the general sentiment here that (at least as it stands
today) a monolithic cluster for many applications does not compete to
per-application clusters unless cost is no issue.  At our scale, the
terabytes of C* data we take in per day means that even very small cost
savings really add up at scale.  And even where cost is no issue, the
additional isolation and workload tailoring is still highly valuable.

On Wed, Feb 22, 2017 at 12:01 PM Edward Capriolo > wrote:



On Wed, Feb 22, 2017 at 1:20 PM, Abhishek Verma > wrote:

We have lots of dedicated Cassandra clusters for large use
cases, but we have a long tail of (~100) of internal customers
who want to store < 200GB of data with < 5k qps and non-critical
data. It does not make sense to create a 3 node dedicated
cluster for each of these small use cases. So we have a shared
cluster into which we onboard these users.

But once in a while, one of the customers will run a ingest job
from HDFS which will pound the shared cluster and break our SLA
for the cluster for all the other customers. Currently, I don't
see anyway to signal back pressure to the ingestion jobs or
throttle their requests. Another example is one customer doing a
large number of range queries which has the same effect.

A simple way to avoid this is to throttle the read or write
requests based on some quota limits for each keyspace or user.

Please see replies inlined:

On Mon, Feb 20, 2017 at 11:46 PM, vincent gromakowski
> wrote:

Aren't you using mesos Cassandra framework to manage your
multiple clusters ? (Seen a presentation in cass summit)

Yes we are
using https://github.com/mesosphere/dcos-cassandra-service and
contribute heavily to it. I am aware of the presentation
(https://www.youtube.com/watch?v=4Ap-1VT2ChU) at the Cassandra
summit as I was the one who gave it :)
This has helped us automate the creation and management of these
clusters.

What's wrong with your current mesos approach ?

Hardware efficiency: Spinning up dedicated clusters for each use
case wastes a lot of hardware resources. One of the approaches
we have taken is spinning up multiple Cassandra nodes belonging
to different clusters on the same physical machine. However, we
still have overhead of managing these separate multi-tenant
clusters.

I am also thinking it's better to split a large cluster into
smallers except if you also manage client layer that query
cass and you can put some backpressure or rate limit in it.

We have an internal storage API layer that some of the clients
use, but there are many customers who use the vanilla DataStax
Java or Python driver. Implementing throttling in each of those
clients does not seem like a viable approach.

Le 21 févr. 2017 2:46 AM, "Edward Capriolo"
> a écrit :

Older versions had a request scheduler api.

I am not aware of the history behind it. Can you please point me
to the JIRA tickets and/or why it was removed?

On Monday, February 20, 2017, Ben Slater
 wrote:

We’ve actually had several customers where we’ve
done the opposite - split large clusters apart to
separate uses cases. We found that this allowed us
to better align hardware with use case requirements
(for example using AWS c3.2xlarge for very hot data
at low latency, m4.xlarge for more general purpose
data) we can also tune JVM settings, etc to meet

Re: Pluggable throttling of read and write queries

2017-02-22 Thread Eric Stevens
> We’ve actually had several customers where we’ve done the opposite -
split large clusters apart to separate uses cases

We do something similar but for a single application.  We're functionally
sharding data to different clusters from a single application.  We can have
different server classes for different types of workloads, we can grow and
size clusters accordingly, and we also do things like time sharding so that
we can let at-rest data go to cheaper storage options.

I agree with the general sentiment here that (at least as it stands today)
a monolithic cluster for many applications does not compete to
per-application clusters unless cost is no issue.  At our scale, the
terabytes of C* data we take in per day means that even very small cost
savings really add up at scale.  And even where cost is no issue, the
additional isolation and workload tailoring is still highly valuable.

On Wed, Feb 22, 2017 at 12:01 PM Edward Capriolo 
wrote:

>
>
> On Wed, Feb 22, 2017 at 1:20 PM, Abhishek Verma  wrote:
>
> We have lots of dedicated Cassandra clusters for large use cases, but we
> have a long tail of (~100) of internal customers who want to store < 200GB
> of data with < 5k qps and non-critical data. It does not make sense to
> create a 3 node dedicated cluster for each of these small use cases. So we
> have a shared cluster into which we onboard these users.
>
> But once in a while, one of the customers will run a ingest job from HDFS
> which will pound the shared cluster and break our SLA for the cluster for
> all the other customers. Currently, I don't see anyway to signal back
> pressure to the ingestion jobs or throttle their requests. Another example
> is one customer doing a large number of range queries which has the same
> effect.
>
> A simple way to avoid this is to throttle the read or write requests based
> on some quota limits for each keyspace or user.
>
> Please see replies inlined:
>
> On Mon, Feb 20, 2017 at 11:46 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> Aren't you using mesos Cassandra framework to manage your multiple
> clusters ? (Seen a presentation in cass summit)
>
> Yes we are using https://github.com/mesosphere/dcos-cassandra-service and
> contribute heavily to it. I am aware of the presentation (
> https://www.youtube.com/watch?v=4Ap-1VT2ChU) at the Cassandra summit as I
> was the one who gave it :)
> This has helped us automate the creation and management of these clusters.
>
> What's wrong with your current mesos approach ?
>
> Hardware efficiency: Spinning up dedicated clusters for each use case
> wastes a lot of hardware resources. One of the approaches we have taken is
> spinning up multiple Cassandra nodes belonging to different clusters on the
> same physical machine. However, we still have overhead of managing these
> separate multi-tenant clusters.
>
> I am also thinking it's better to split a large cluster into smallers
> except if you also manage client layer that query cass and you can put some
> backpressure or rate limit in it.
>
> We have an internal storage API layer that some of the clients use, but
> there are many customers who use the vanilla DataStax Java or Python
> driver. Implementing throttling in each of those clients does not seem like
> a viable approach.
>
> Le 21 févr. 2017 2:46 AM, "Edward Capriolo"  a
> écrit :
>
> Older versions had a request scheduler api.
>
> I am not aware of the history behind it. Can you please point me to the
> JIRA tickets and/or why it was removed?
>
> On Monday, February 20, 2017, Ben Slater 
> wrote:
>
> We’ve actually had several customers where we’ve done the opposite - split
> large clusters apart to separate uses cases. We found that this allowed us
> to better align hardware with use case requirements (for example using AWS
> c3.2xlarge for very hot data at low latency, m4.xlarge for more general
> purpose data) we can also tune JVM settings, etc to meet those uses cases.
>
> There have been several instances where we have moved customers out of the
> shared cluster to their own dedicated clusters because they outgrew our
> limitations. But I don't think it makes sense to move all the small use
> cases into their separate clusters.
>
> On Mon, 20 Feb 2017 at 22:21 Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
> On Sat, Feb 18, 2017 at 3:12 AM, Abhishek Verma  wrote:
>
> Cassandra is being used on a large scale at Uber. We usually create
> dedicated clusters for each of our internal use cases, however that is
> difficult to scale and manage.
>
> We are investigating the approach of using a single shared cluster with
> 100s of nodes and handle 10s to 100s of different use cases for different
> products in the same cluster. We can define different keyspaces for each of
> them, but that does not help in case of noisy neighbors.
>
> Does anybody in the community have 

Re: Pluggable throttling of read and write queries

2017-02-22 Thread Edward Capriolo
On Wed, Feb 22, 2017 at 1:20 PM, Abhishek Verma  wrote:

> We have lots of dedicated Cassandra clusters for large use cases, but we
> have a long tail of (~100) of internal customers who want to store < 200GB
> of data with < 5k qps and non-critical data. It does not make sense to
> create a 3 node dedicated cluster for each of these small use cases. So we
> have a shared cluster into which we onboard these users.
>
> But once in a while, one of the customers will run a ingest job from HDFS
> which will pound the shared cluster and break our SLA for the cluster for
> all the other customers. Currently, I don't see anyway to signal back
> pressure to the ingestion jobs or throttle their requests. Another example
> is one customer doing a large number of range queries which has the same
> effect.
>
> A simple way to avoid this is to throttle the read or write requests based
> on some quota limits for each keyspace or user.
>
> Please see replies inlined:
>
> On Mon, Feb 20, 2017 at 11:46 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Aren't you using mesos Cassandra framework to manage your multiple
>> clusters ? (Seen a presentation in cass summit)
>>
> Yes we are using https://github.com/mesosphere/dcos-cassandra-service and
> contribute heavily to it. I am aware of the presentation (
> https://www.youtube.com/watch?v=4Ap-1VT2ChU) at the Cassandra summit as I
> was the one who gave it :)
> This has helped us automate the creation and management of these clusters.
>
>> What's wrong with your current mesos approach ?
>>
> Hardware efficiency: Spinning up dedicated clusters for each use case
> wastes a lot of hardware resources. One of the approaches we have taken is
> spinning up multiple Cassandra nodes belonging to different clusters on the
> same physical machine. However, we still have overhead of managing these
> separate multi-tenant clusters.
>
>> I am also thinking it's better to split a large cluster into smallers
>> except if you also manage client layer that query cass and you can put some
>> backpressure or rate limit in it.
>>
> We have an internal storage API layer that some of the clients use, but
> there are many customers who use the vanilla DataStax Java or Python
> driver. Implementing throttling in each of those clients does not seem like
> a viable approach.
>
> Le 21 févr. 2017 2:46 AM, "Edward Capriolo"  a
>> écrit :
>>
>>> Older versions had a request scheduler api.
>>
>> I am not aware of the history behind it. Can you please point me to the
> JIRA tickets and/or why it was removed?
>
> On Monday, February 20, 2017, Ben Slater 
>>> wrote:
>>>
 We’ve actually had several customers where we’ve done the opposite -
 split large clusters apart to separate uses cases. We found that this
 allowed us to better align hardware with use case requirements (for example
 using AWS c3.2xlarge for very hot data at low latency, m4.xlarge for more
 general purpose data) we can also tune JVM settings, etc to meet those uses
 cases.

>>> There have been several instances where we have moved customers out of
> the shared cluster to their own dedicated clusters because they outgrew our
> limitations. But I don't think it makes sense to move all the small use
> cases into their separate clusters.
>
> On Mon, 20 Feb 2017 at 22:21 Oleksandr Shulgin <
 oleksandr.shul...@zalando.de> wrote:

> On Sat, Feb 18, 2017 at 3:12 AM, Abhishek Verma 
> wrote:
>
>> Cassandra is being used on a large scale at Uber. We usually create
>> dedicated clusters for each of our internal use cases, however that is
>> difficult to scale and manage.
>>
>> We are investigating the approach of using a single shared cluster
>> with 100s of nodes and handle 10s to 100s of different use cases for
>> different products in the same cluster. We can define different keyspaces
>> for each of them, but that does not help in case of noisy neighbors.
>>
>> Does anybody in the community have similar large shared clusters
>> and/or face noisy neighbor issues?
>>
>
> Hi,
>
> We've never tried this approach and given my limited experience I
> would find this a terrible idea from the perspective of maintenance
> (remember the old saying about basket and eggs?)
>
 What if you have a limited number of baskets and several eggs which are
> not critical if they break rarely.
>
>
>> What potential benefits do you see?
>
 The main benefit of sharing a single cluster among several small use
> cases is increasing the hardware efficiency and decreasing the management
> overhead of a large number of clusters.
>
> Thanks everyone for your replies and questions.
>
> -Abhishek.
>

I agree with these assertions. On one hand I think about a "managed
service" like say Amazon DynamoDB. They likely start with very/very/very
large 

Re: OpsCenter w/SSL

2017-02-22 Thread Jacob Shadix
If i start the agent on the cluster with encryption, I see lots of these
messages in the C* logs -

Unexpected exception during request; channel
io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record

And get an error connecting to the cluster from opscenterd.

-- Jacob Shadix

On Wed, Feb 22, 2017 at 1:13 PM, Bulat Shakirzyanov <
bulat.shakirzya...@datastax.com> wrote:

> Hi Jacob,
>
> OpsCenter supports simultaneous management of Cassandra clusters both with
> and without client-to-node encryption enabled.
>
> The only time you'd need to use SSL everywhere, is when encrypting
> OpsCenter Daemon to OpsCenter Agents connections. In that case, you have to
> make sure all OpsCenter Agents connecting to a given OpsCenter Daemon use
> SSL even if those agents belong to different Cassandra clusters.
>
>
> On Wed, Feb 22, 2017 at 11:18 AM, Jacob Shadix 
> wrote:
>
>> I have OpsCenter 6.0.7 setup managing multiple clusters. Only one of
>> those clusters has encryption turned on (both node-to-node and
>> client-to-node). In order to manage that cluster through OpsCenter, do all
>> subsequent clusters have to have encryption turned on?
>>
>> -- Jacob Shadix
>>
>
>
>
> --
> Cheers,
> Bulat Shakirzyanov | @avalanche123 
>


Re: Pluggable throttling of read and write queries

2017-02-22 Thread Abhishek Verma
We have lots of dedicated Cassandra clusters for large use cases, but we
have a long tail of (~100) of internal customers who want to store < 200GB
of data with < 5k qps and non-critical data. It does not make sense to
create a 3 node dedicated cluster for each of these small use cases. So we
have a shared cluster into which we onboard these users.

But once in a while, one of the customers will run a ingest job from HDFS
which will pound the shared cluster and break our SLA for the cluster for
all the other customers. Currently, I don't see anyway to signal back
pressure to the ingestion jobs or throttle their requests. Another example
is one customer doing a large number of range queries which has the same
effect.

A simple way to avoid this is to throttle the read or write requests based
on some quota limits for each keyspace or user.

Please see replies inlined:

On Mon, Feb 20, 2017 at 11:46 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Aren't you using mesos Cassandra framework to manage your multiple
> clusters ? (Seen a presentation in cass summit)
>
Yes we are using https://github.com/mesosphere/dcos-cassandra-service and
contribute heavily to it. I am aware of the presentation (
https://www.youtube.com/watch?v=4Ap-1VT2ChU) at the Cassandra summit as I
was the one who gave it :)
This has helped us automate the creation and management of these clusters.

> What's wrong with your current mesos approach ?
>
Hardware efficiency: Spinning up dedicated clusters for each use case
wastes a lot of hardware resources. One of the approaches we have taken is
spinning up multiple Cassandra nodes belonging to different clusters on the
same physical machine. However, we still have overhead of managing these
separate multi-tenant clusters.

> I am also thinking it's better to split a large cluster into smallers
> except if you also manage client layer that query cass and you can put some
> backpressure or rate limit in it.
>
We have an internal storage API layer that some of the clients use, but
there are many customers who use the vanilla DataStax Java or Python
driver. Implementing throttling in each of those clients does not seem like
a viable approach.

Le 21 févr. 2017 2:46 AM, "Edward Capriolo"  a
> écrit :
>
>> Older versions had a request scheduler api.
>
> I am not aware of the history behind it. Can you please point me to the
JIRA tickets and/or why it was removed?

On Monday, February 20, 2017, Ben Slater  wrote:
>>
>>> We’ve actually had several customers where we’ve done the opposite -
>>> split large clusters apart to separate uses cases. We found that this
>>> allowed us to better align hardware with use case requirements (for example
>>> using AWS c3.2xlarge for very hot data at low latency, m4.xlarge for more
>>> general purpose data) we can also tune JVM settings, etc to meet those uses
>>> cases.
>>>
>> There have been several instances where we have moved customers out of
the shared cluster to their own dedicated clusters because they outgrew our
limitations. But I don't think it makes sense to move all the small use
cases into their separate clusters.

On Mon, 20 Feb 2017 at 22:21 Oleksandr Shulgin 
>>> wrote:
>>>
 On Sat, Feb 18, 2017 at 3:12 AM, Abhishek Verma  wrote:

> Cassandra is being used on a large scale at Uber. We usually create
> dedicated clusters for each of our internal use cases, however that is
> difficult to scale and manage.
>
> We are investigating the approach of using a single shared cluster
> with 100s of nodes and handle 10s to 100s of different use cases for
> different products in the same cluster. We can define different keyspaces
> for each of them, but that does not help in case of noisy neighbors.
>
> Does anybody in the community have similar large shared clusters
> and/or face noisy neighbor issues?
>

 Hi,

 We've never tried this approach and given my limited experience I would
 find this a terrible idea from the perspective of maintenance (remember the
 old saying about basket and eggs?)

>>> What if you have a limited number of baskets and several eggs which are
not critical if they break rarely.


> What potential benefits do you see?

>>> The main benefit of sharing a single cluster among several small use
cases is increasing the hardware efficiency and decreasing the management
overhead of a large number of clusters.

Thanks everyone for your replies and questions.

-Abhishek.


Re: OpsCenter w/SSL

2017-02-22 Thread Bulat Shakirzyanov
Hi Jacob,

OpsCenter supports simultaneous management of Cassandra clusters both with
and without client-to-node encryption enabled.

The only time you'd need to use SSL everywhere, is when encrypting
OpsCenter Daemon to OpsCenter Agents connections. In that case, you have to
make sure all OpsCenter Agents connecting to a given OpsCenter Daemon use
SSL even if those agents belong to different Cassandra clusters.


On Wed, Feb 22, 2017 at 11:18 AM, Jacob Shadix 
wrote:

> I have OpsCenter 6.0.7 setup managing multiple clusters. Only one of those
> clusters has encryption turned on (both node-to-node and client-to-node).
> In order to manage that cluster through OpsCenter, do all subsequent
> clusters have to have encryption turned on?
>
> -- Jacob Shadix
>



-- 
Cheers,
Bulat Shakirzyanov | @avalanche123 


OpsCenter w/SSL

2017-02-22 Thread Jacob Shadix
I have OpsCenter 6.0.7 setup managing multiple clusters. Only one of those
clusters has encryption turned on (both node-to-node and client-to-node).
In order to manage that cluster through OpsCenter, do all subsequent
clusters have to have encryption turned on?

-- Jacob Shadix


Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Edward Capriolo
On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg  wrote:

> Hi,
>
> No it's not going to be in 3.11.x. The earliest release it could make it
> into is 4.0.
>
> Ariel
>
> On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
>
> Hi Ariel,
>
> Can we really expect the fix in 3.11.x as the ticket
> https://issues.apache.org/jira/browse/CASSANDRA-6246
> 
>  says?
>
> Thanks,
> kant
>
> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> That would work and would help a lot with the dueling proposer issue.
>
> A lot of the leader election stuff is designed to reduce the number of
> roundtrips and not just address the dueling proposer issue. Those will have
> downtime because it's there for correctness. Just adding an affinity for a
> specific proposer is probably a free lunch.
>
> I don't think you can group keys because the Paxos proposals are per
> partition which is why we get linear scale out for Paxos. I don't believe
> it's linearizable across multiple partitions. You can use the clustering
> key and deterministically pick one of the live replicas for that clustering
> key. Sort the list of replicas by IP, hash the clustering key, use the hash
> as an index into the list of replicas.
>
> Batching is of limited usefulness because we only use Paxos for CAS I
> think? So in a batch by definition all but one will fail the CAS. This is
> something where a distinguished coordinator could help by failing the rest
> of the contending requests more inexpensively than it currently does.
>
>
> Ariel
>
> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>
>
>
> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> Classic Paxos doesn't have a leader. There are variants on the original
> Lamport approach that will elect a leader (or some other variation like
> Mencius) to improve throughput, latency, and performance under contention.
> Cassandra implements the approach from the beginning of "Paxos Made Simple"
> (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware
> of. There is no distinguished proposer (leader).
>
> That paper does go on to discuss electing a distinguished proposer, but
> that was never done for C*. I believe it's not considered a good fit for C*
> philosophically.
>
> Ariel
>
> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>
> @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't
> need any designated leader for C* but I am assuming the paxos that is
> implemented today for LWT's requires Leader election and If so, don't we
> need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1
> constraint to tolerate F failures ? I understand it is not needed when not
> using LWT's since Cassandra is a master-less system.
>
> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali  wrote:
>
> Thanks Ariel! Yes I knew there are so many variations and optimizations of
> Paxos. I just wanted to see if we had any plans on improving the existing
> Paxos implementation and it is great to see the work is under progress! I
> am going to follow that ticket and read up the references pointed in it
>
>
> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> Cassandra's implementation of Paxos doesn't implement many optimizations
> that would drastically improve throughput and latency. You need consensus,
> but it doesn't have to be exorbitantly expensive and fall over under any
> kind of contention.
>
> For instance you could implement EPaxos https://issues.apache.o
> rg/jira/browse/CASSANDRA-6246
> ,
> batch multiple operations into the same Paxos round, have an affinity for a
> specific proposer for a specific partition, implement asynchronous commit,
> use a more efficient implementation of the Paxos log, and maybe other
> things.
>
>
> Ariel
>
>
>
> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>
> Hi Kant,
>
> If you read the published papers about Paxos, you will most probably
> recognize that there is no way to "do it better". This is a conceptional
> thing due to the nature of distributed systems + the CAP theorem.
> If you want A+P in the triangle, then C is very expensive. CS is made for
> A+P mostly with tunable C. In ACID databases this is a completely different
> thing as they are mostly either not partition tolerant, not highly
> available or not scalable (in a distributed manner, not speaking of
> "monolithic super servers").
>
> There is no free lunch ...
>
>
> 2017-02-10 11:09 GMT+01:00 Kant Kodali :
>
> "That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra."
>
> yes LWT's are expensive. Are there any plans to make this better?
>
> On Fri, Feb 10, 2017 at 12:17 AM, 

Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Ariel Weisberg
Hi,



No it's not going to be in 3.11.x. The earliest release it could make it
into is 4.0.


Ariel



On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:

> Hi Ariel,

> 

> Can we really expect the fix in 3.11.x as the ticket
> https://issues.apache.org/jira/browse/CASSANDRA-6246[1] says?
> 

> Thanks,

> kant

> 

> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg
>  wrote:
>> __

>> Hi,

>> 

>> That would work and would help a lot with the dueling proposer issue.
>> 

>> A lot of the leader election stuff is designed to reduce the number
>> of roundtrips and not just address the dueling proposer issue. Those
>> will have downtime because it's there for correctness. Just adding an
>> affinity for a specific proposer is probably a free lunch.
>> 

>> I don't think you can group keys because the Paxos proposals are per
>> partition which is why we get linear scale out for Paxos. I don't
>> believe it's linearizable across multiple partitions. You can use the
>> clustering key and deterministically pick one of the live replicas
>> for that clustering key. Sort the list of replicas by IP, hash the
>> clustering key, use the hash as an index into the list of replicas.
>> 

>> Batching is of limited usefulness because we only use Paxos for CAS I
>> think? So in a batch by definition all but one will fail the CAS.
>> This is something where a distinguished coordinator could help by
>> failing the rest of the contending requests more inexpensively than
>> it currently does.
>> 

>> 

>> Ariel

>> 

>> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:

>>> 

>>> 

>>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg 
>>> wrote:
 __

 Hi,

 

 Classic Paxos doesn't have a leader. There are variants on the
 original Lamport approach that will elect a leader (or some other
 variation like Mencius) to improve throughput, latency, and
 performance under contention. Cassandra implements the approach
 from the beginning of "Paxos Made Simple" (https://goo.gl/SrP0Wb)
 with no additional optimizations that I am aware of. There is no
 distinguished proposer (leader).
 

 That paper does  go on to discuss electing a distinguished
 proposer, but that was never done for C*. I believe it's not
 considered a good fit for C* philosophically.
 

 Ariel

 

 On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:

> @Ariel Weisberg EPaxos looks very interesting as it looks like it
> doesn't need any designated leader for C* but I am assuming the
> paxos that is implemented today for LWT's requires Leader election
> and If so, don't we need to have an odd number of nodes or racks
> or DC's to satisfy N = 2F + 1 constraint to tolerate F failures ?
> I understand it is not needed when not using LWT's since Cassandra
> is a master-less system.
> 

> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali 
> wrote:
>> Thanks Ariel! Yes I knew there are so many variations and
>> optimizations of Paxos. I just wanted to see if we had any plans
>> on improving the existing Paxos implementation and it is great to
>> see the work is under progress! I am going to follow that ticket
>> and read up the references pointed in it
>> 

>> 

>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg
>>  wrote:
>>> __

>>> Hi,

>>> 

>>> Cassandra's implementation of Paxos doesn't implement many
>>> optimizations that would drastically improve throughput and
>>> latency. You need consensus, but it doesn't have to be
>>> exorbitantly expensive and fall over under any kind of
>>> contention.
>>> 

>>> For instance you could implement EPaxos
>>> https://issues.apache.org/jira/browse/CASSANDRA-6246[2], batch
>>> multiple operations into the same Paxos round, have an affinity
>>> for a specific proposer for a specific partition, implement
>>> asynchronous commit, use a more efficient implementation of the
>>> Paxos log, and maybe other things.
>>> 

>>> 

>>> Ariel

>>> 

>>> 

>>> 

>>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:

 Hi Kant,

 

 If you read the published papers about Paxos, you will most
 probably recognize that there is no way to "do it better". This
 is a conceptional thing due to the nature of distributed
 systems + the CAP theorem.
 If you want A+P in the triangle, then C is very expensive. CS
 is made for A+P mostly with tunable C. In ACID databases this
 is a completely different thing as they are mostly either not
 partition tolerant, not highly available or not scalable (in a
 distributed manner, not speaking of "monolithic super
 servers").
 

 There is no free lunch ...

 

 

 

Re: Trouble implementing CAS operation with LWT query

2017-02-22 Thread Edward Capriolo
On Wed, Feb 22, 2017 at 8:42 AM, 안정아  wrote:

> Hi, all
>
>
>
> I'm trying to implement a typical CAS operation with LWT query(conditional
> update).
>
> But I'm having trouble keeping integrity of the result when
> WriteTimeoutException occurs.
>
> according to http://www.datastax.com/dev/blog/cassandra-error-handling-
> done-right
>
> "If the paxos phase fails, the driver will throw a WriteTimeoutException
> with a WriteType.
>
> CAS as retrieved with WriteTimeoutException#getWriteType().
>
> In this situation you can’t know if the CAS operation has been applied..."
>
> 1) Doesn't it ruin the whole point of using LWT for CAS operation if you
> can't be sure whether the query is applied or not?
>
> 2-1) Is there anyway to know whether the query is applied when timeout
> occurred?
>
> 2-2) If we can't tell, are there any way to workaround this and keep the
> CAS integrity?
>
>
>
> Thanks!
>
>
>
>
>

What you might be first trying to do is count the timeouts:

https://github.com/edwardcapriolo/ec/blob/master/src/test/java/Base/CompareAndSwapTest.java

https://github.com/edwardcapriolo/ec/blob/master/src/test/java/Base/CompareAndSwapTest.java#L99

This tactic does not work.

However you can keep re-reading at CL.Serial to determine i the update
applied.

What I found this to mean is you CANT do this:

for (i=0;i<2000;i++){
   new Thread(){ () -> { doCasInsert() } }.start();
}

Assert.assertEquals(2000, getTotalInserts())

But you CAN do this:

for (i=0;i<2000;i++){
  new Thread ( () -> {
  count = "SELECT count(1) from".setConstistencyLevel(cl.Serial)
  if (count < 2000){
 doCasInsert()
  }
  });
}


Essentially, because you want know if a CAS operation will succeed even in
a client timeout in the future you can not "COUNT" on the insert side. Y


Trouble implementing CAS operation with LWT query

2017-02-22 Thread 안정아


Hi, all
 
I'm trying to implement a typical CAS operation with LWT query(conditional update). 
But I'm having trouble keeping integrity of the result when WriteTimeoutException occurs. 
according to http://www.datastax.com/dev/blog/cassandra-error-handling-done-right 
"If the paxos phase fails, the driver will throw a WriteTimeoutException with a WriteType.
CAS as retrieved with WriteTimeoutException#getWriteType().
In this situation you can’t know if the CAS operation has been applied..." 
1) Doesn't it ruin the whole point of using LWT for CAS operation if you can't be sure whether the query is applied or not? 
2-1) Is there anyway to know whether the query is applied when timeout occurred? 
2-2) If we can't tell, are there any way to workaround this and keep the CAS integrity?
 
Thanks! 
 


Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Kant Kodali
Hi Ariel,

Can we really expect the fix in 3.11.x as the ticket https://issues.apache.
org/jira/browse/CASSANDRA-6246

 says?

Thanks,
kant

On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg  wrote:

> Hi,
>
> That would work and would help a lot with the dueling proposer issue.
>
> A lot of the leader election stuff is designed to reduce the number of
> roundtrips and not just address the dueling proposer issue. Those will have
> downtime because it's there for correctness. Just adding an affinity for a
> specific proposer is probably a free lunch.
>
> I don't think you can group keys because the Paxos proposals are per
> partition which is why we get linear scale out for Paxos. I don't believe
> it's linearizable across multiple partitions. You can use the clustering
> key and deterministically pick one of the live replicas for that clustering
> key. Sort the list of replicas by IP, hash the clustering key, use the hash
> as an index into the list of replicas.
>
> Batching is of limited usefulness because we only use Paxos for CAS I
> think? So in a batch by definition all but one will fail the CAS. This is
> something where a distinguished coordinator could help by failing the rest
> of the contending requests more inexpensively than it currently does.
>
> Ariel
> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>
>
>
> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> Classic Paxos doesn't have a leader. There are variants on the original
> Lamport approach that will elect a leader (or some other variation like
> Mencius) to improve throughput, latency, and performance under contention.
> Cassandra implements the approach from the beginning of "Paxos Made Simple"
> (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware
> of. There is no distinguished proposer (leader).
>
> That paper does go on to discuss electing a distinguished proposer, but
> that was never done for C*. I believe it's not considered a good fit for C*
> philosophically.
>
> Ariel
>
> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>
> @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't
> need any designated leader for C* but I am assuming the paxos that is
> implemented today for LWT's requires Leader election and If so, don't we
> need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1
> constraint to tolerate F failures ? I understand it is not needed when not
> using LWT's since Cassandra is a master-less system.
>
> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali  wrote:
>
> Thanks Ariel! Yes I knew there are so many variations and optimizations of
> Paxos. I just wanted to see if we had any plans on improving the existing
> Paxos implementation and it is great to see the work is under progress! I
> am going to follow that ticket and read up the references pointed in it
>
>
> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> Cassandra's implementation of Paxos doesn't implement many optimizations
> that would drastically improve throughput and latency. You need consensus,
> but it doesn't have to be exorbitantly expensive and fall over under any
> kind of contention.
>
> For instance you could implement EPaxos https://issues.apache.o
> rg/jira/browse/CASSANDRA-6246
> ,
> batch multiple operations into the same Paxos round, have an affinity for a
> specific proposer for a specific partition, implement asynchronous commit,
> use a more efficient implementation of the Paxos log, and maybe other
> things.
>
>
> Ariel
>
>
>
> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>
> Hi Kant,
>
> If you read the published papers about Paxos, you will most probably
> recognize that there is no way to "do it better". This is a conceptional
> thing due to the nature of distributed systems + the CAP theorem.
> If you want A+P in the triangle, then C is very expensive. CS is made for
> A+P mostly with tunable C. In ACID databases this is a completely different
> thing as they are mostly either not partition tolerant, not highly
> available or not scalable (in a distributed manner, not speaking of
> "monolithic super servers").
>
> There is no free lunch ...
>
>
> 2017-02-10 11:09 GMT+01:00 Kant Kodali :
>
> "That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra."
>
> yes LWT's are expensive. Are there any plans to make this better?
>
> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali  wrote:
>
> Hi Jon,
>
> Thanks a lot for your response. I am well aware that the LWW != LWT but I
> was talking more in terms of LWW with respective to LWT's which I believe
> you answered. so thanks much!
>
>
> kant
>
>
> On Thu, Feb 9,