Re: How does cassandra achieve Linearizability?

2017-02-26 Thread Kant Kodali
Is there way to apply the commits from this
https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk branch
to Apache Cassandra 3.10 branch? I thought I could just merge these two
branches but looks like there are several trunks so I am confused which
trunk I am merging to?
I want to merge it just to try on my local machine.

Thanks!

On Wed, Feb 22, 2017 at 8:04 PM, Michael Shuler 
wrote:

> I updated the fix version on CASSANDRA-6246 to 4.x. The 3.11.x edit was
> a bulk move when removing the cassandra-3.X branch and the 3.x Jira
> version. There are likely other new feature tickets that should really
> say 4.x.
>
> --
> Kind regards,
> Michael
>
> On 02/22/2017 07:28 PM, Kant Kodali wrote:
> > I hope that patch is reviewed as quickly as possible. We use LWT's
> > heavily and we are getting a throughput of 600 writes/sec and each write
> > is 1KB in our case.
> >
> >
> >
> >
> >
> > On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo  > > wrote:
> >
> >
> >
> > On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg  > > wrote:
> >
> > __
> > Hi,
> >
> > No it's not going to be in 3.11.x. The earliest release it could
> > make it into is 4.0.
> >
> > Ariel
> >
> > On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
> >> Hi Ariel,
> >>
> >> Can we really expect the fix in 3.11.x as the
> >> ticket https://issues.apache.org/jira/browse/CASSANDRA-6246
> >>  jql=text%20~%20%22epaxos%22> says?
> >>
> >> Thanks,
> >> kant
> >>
> >> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg
> >> > wrote:
> >>
> >> __
> >> Hi,
> >>
> >> That would work and would help a lot with the dueling
> >> proposer issue.
> >>
> >> A lot of the leader election stuff is designed to reduce
> >> the number of roundtrips and not just address the dueling
> >> proposer issue. Those will have downtime because it's
> >> there for correctness. Just adding an affinity for a
> >> specific proposer is probably a free lunch.
> >>
> >> I don't think you can group keys because the Paxos
> >> proposals are per partition which is why we get linear
> >> scale out for Paxos. I don't believe it's linearizable
> >> across multiple partitions. You can use the clustering key
> >> and deterministically pick one of the live replicas for
> >> that clustering key. Sort the list of replicas by IP, hash
> >> the clustering key, use the hash as an index into the list
> >> of replicas.
> >>
> >> Batching is of limited usefulness because we only use
> >> Paxos for CAS I think? So in a batch by definition all but
> >> one will fail the CAS. This is something where a
> >> distinguished coordinator could help by failing the rest
> >> of the contending requests more inexpensively than it
> >> currently does.
> >>
> >>
> >> Ariel
> >>
> >> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
> >>>
> >>>
> >>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg
> >>> > wrote:
> >>>
> >>> __
> >>> Hi,
> >>>
> >>> Classic Paxos doesn't have a leader. There are
> >>> variants on the original Lamport approach that will
> >>> elect a leader (or some other variation like Mencius)
> >>> to improve throughput, latency, and performance under
> >>> contention. Cassandra implements the approach from
> >>> the beginning of "Paxos Made Simple"
> >>> (https://goo.gl/SrP0Wb) with no additional
> >>> optimizations that I am aware of. There is no
> >>> distinguished proposer (leader).
> >>>
> >>> That paper does go on to discuss electing a
> >>> distinguished proposer, but that was never done for
> >>> C*. I believe it's not considered a good fit for C*
> >>> philosophically.
> >>>
> >>> Ariel
> >>>
> >>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>  @Ariel Weisberg EPaxos looks very interesting as it
>  looks like it doesn't need any designated leader for
>  C* but I am assuming the paxos that is implemented
>  today for LWT's requires Leader election and If so,
>  don't we need to have an odd number of nodes 

Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Michael Shuler
I updated the fix version on CASSANDRA-6246 to 4.x. The 3.11.x edit was
a bulk move when removing the cassandra-3.X branch and the 3.x Jira
version. There are likely other new feature tickets that should really
say 4.x.

-- 
Kind regards,
Michael

On 02/22/2017 07:28 PM, Kant Kodali wrote:
> I hope that patch is reviewed as quickly as possible. We use LWT's
> heavily and we are getting a throughput of 600 writes/sec and each write
> is 1KB in our case.
> 
> 
> 
> 
> 
> On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo  > wrote:
> 
> 
> 
> On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg  > wrote:
> 
> __
> Hi,
> 
> No it's not going to be in 3.11.x. The earliest release it could
> make it into is 4.0.
> 
> Ariel
> 
> On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
>> Hi Ariel,
>>
>> Can we really expect the fix in 3.11.x as the
>> ticket https://issues.apache.org/jira/browse/CASSANDRA-6246
>> 
>> 
>>  says?
>>
>> Thanks,
>> kant
>>
>> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg
>> > wrote:
>>
>> __
>> Hi,
>>
>> That would work and would help a lot with the dueling
>> proposer issue.
>>
>> A lot of the leader election stuff is designed to reduce
>> the number of roundtrips and not just address the dueling
>> proposer issue. Those will have downtime because it's
>> there for correctness. Just adding an affinity for a
>> specific proposer is probably a free lunch.
>>
>> I don't think you can group keys because the Paxos
>> proposals are per partition which is why we get linear
>> scale out for Paxos. I don't believe it's linearizable
>> across multiple partitions. You can use the clustering key
>> and deterministically pick one of the live replicas for
>> that clustering key. Sort the list of replicas by IP, hash
>> the clustering key, use the hash as an index into the list
>> of replicas.
>>
>> Batching is of limited usefulness because we only use
>> Paxos for CAS I think? So in a batch by definition all but
>> one will fail the CAS. This is something where a
>> distinguished coordinator could help by failing the rest
>> of the contending requests more inexpensively than it
>> currently does.
>>
>>
>> Ariel
>>
>> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>>>
>>>
>>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg
>>> > wrote:
>>>
>>> __
>>> Hi,
>>>
>>> Classic Paxos doesn't have a leader. There are
>>> variants on the original Lamport approach that will
>>> elect a leader (or some other variation like Mencius)
>>> to improve throughput, latency, and performance under
>>> contention. Cassandra implements the approach from
>>> the beginning of "Paxos Made Simple"
>>> (https://goo.gl/SrP0Wb) with no additional
>>> optimizations that I am aware of. There is no
>>> distinguished proposer (leader).
>>>
>>> That paper does go on to discuss electing a
>>> distinguished proposer, but that was never done for
>>> C*. I believe it's not considered a good fit for C*
>>> philosophically.
>>>
>>> Ariel
>>>
>>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
 @Ariel Weisberg EPaxos looks very interesting as it
 looks like it doesn't need any designated leader for
 C* but I am assuming the paxos that is implemented
 today for LWT's requires Leader election and If so,
 don't we need to have an odd number of nodes or
 racks or DC's to satisfy N = 2F + 1 constraint to
 tolerate F failures ? I understand it is not needed
 when not using LWT's since Cassandra is a
 master-less system.

 On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali
 > wrote:

 Thanks Ariel! Yes I knew there are so many
 variations and optimizations of Paxos. I just
 wanted to see if we had any plans on improving
  

Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Kant Kodali
I hope that patch is reviewed as quickly as possible. We use LWT's heavily
and we are getting a throughput of 600 writes/sec and each write is 1KB in
our case.





On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo 
wrote:

>
>
> On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg  wrote:
>
>> Hi,
>>
>> No it's not going to be in 3.11.x. The earliest release it could make it
>> into is 4.0.
>>
>> Ariel
>>
>> On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
>>
>> Hi Ariel,
>>
>> Can we really expect the fix in 3.11.x as the ticket
>> https://issues.apache.org/jira/browse/CASSANDRA-6246
>> 
>>  says?
>>
>> Thanks,
>> kant
>>
>> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg 
>> wrote:
>>
>>
>> Hi,
>>
>> That would work and would help a lot with the dueling proposer issue.
>>
>> A lot of the leader election stuff is designed to reduce the number of
>> roundtrips and not just address the dueling proposer issue. Those will have
>> downtime because it's there for correctness. Just adding an affinity for a
>> specific proposer is probably a free lunch.
>>
>> I don't think you can group keys because the Paxos proposals are per
>> partition which is why we get linear scale out for Paxos. I don't believe
>> it's linearizable across multiple partitions. You can use the clustering
>> key and deterministically pick one of the live replicas for that clustering
>> key. Sort the list of replicas by IP, hash the clustering key, use the hash
>> as an index into the list of replicas.
>>
>> Batching is of limited usefulness because we only use Paxos for CAS I
>> think? So in a batch by definition all but one will fail the CAS. This is
>> something where a distinguished coordinator could help by failing the rest
>> of the contending requests more inexpensively than it currently does.
>>
>>
>> Ariel
>>
>> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>>
>>
>>
>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg 
>> wrote:
>>
>>
>> Hi,
>>
>> Classic Paxos doesn't have a leader. There are variants on the original
>> Lamport approach that will elect a leader (or some other variation like
>> Mencius) to improve throughput, latency, and performance under contention.
>> Cassandra implements the approach from the beginning of "Paxos Made Simple"
>> (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware
>> of. There is no distinguished proposer (leader).
>>
>> That paper does go on to discuss electing a distinguished proposer, but
>> that was never done for C*. I believe it's not considered a good fit for C*
>> philosophically.
>>
>> Ariel
>>
>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>>
>> @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't
>> need any designated leader for C* but I am assuming the paxos that is
>> implemented today for LWT's requires Leader election and If so, don't we
>> need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1
>> constraint to tolerate F failures ? I understand it is not needed when not
>> using LWT's since Cassandra is a master-less system.
>>
>> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali  wrote:
>>
>> Thanks Ariel! Yes I knew there are so many variations and optimizations
>> of Paxos. I just wanted to see if we had any plans on improving the
>> existing Paxos implementation and it is great to see the work is under
>> progress! I am going to follow that ticket and read up the references
>> pointed in it
>>
>>
>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg 
>> wrote:
>>
>>
>> Hi,
>>
>> Cassandra's implementation of Paxos doesn't implement many optimizations
>> that would drastically improve throughput and latency. You need consensus,
>> but it doesn't have to be exorbitantly expensive and fall over under any
>> kind of contention.
>>
>> For instance you could implement EPaxos https://issues.apache.o
>> rg/jira/browse/CASSANDRA-6246
>> ,
>> batch multiple operations into the same Paxos round, have an affinity for a
>> specific proposer for a specific partition, implement asynchronous commit,
>> use a more efficient implementation of the Paxos log, and maybe other
>> things.
>>
>>
>> Ariel
>>
>>
>>
>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>>
>> Hi Kant,
>>
>> If you read the published papers about Paxos, you will most probably
>> recognize that there is no way to "do it better". This is a conceptional
>> thing due to the nature of distributed systems + the CAP theorem.
>> If you want A+P in the triangle, then C is very expensive. CS is made for
>> A+P mostly with tunable C. In ACID databases this is a completely different
>> thing as they are mostly either not partition tolerant, not highly
>> available or not 

Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Edward Capriolo
On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg  wrote:

> Hi,
>
> No it's not going to be in 3.11.x. The earliest release it could make it
> into is 4.0.
>
> Ariel
>
> On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
>
> Hi Ariel,
>
> Can we really expect the fix in 3.11.x as the ticket
> https://issues.apache.org/jira/browse/CASSANDRA-6246
> 
>  says?
>
> Thanks,
> kant
>
> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> That would work and would help a lot with the dueling proposer issue.
>
> A lot of the leader election stuff is designed to reduce the number of
> roundtrips and not just address the dueling proposer issue. Those will have
> downtime because it's there for correctness. Just adding an affinity for a
> specific proposer is probably a free lunch.
>
> I don't think you can group keys because the Paxos proposals are per
> partition which is why we get linear scale out for Paxos. I don't believe
> it's linearizable across multiple partitions. You can use the clustering
> key and deterministically pick one of the live replicas for that clustering
> key. Sort the list of replicas by IP, hash the clustering key, use the hash
> as an index into the list of replicas.
>
> Batching is of limited usefulness because we only use Paxos for CAS I
> think? So in a batch by definition all but one will fail the CAS. This is
> something where a distinguished coordinator could help by failing the rest
> of the contending requests more inexpensively than it currently does.
>
>
> Ariel
>
> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>
>
>
> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> Classic Paxos doesn't have a leader. There are variants on the original
> Lamport approach that will elect a leader (or some other variation like
> Mencius) to improve throughput, latency, and performance under contention.
> Cassandra implements the approach from the beginning of "Paxos Made Simple"
> (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware
> of. There is no distinguished proposer (leader).
>
> That paper does go on to discuss electing a distinguished proposer, but
> that was never done for C*. I believe it's not considered a good fit for C*
> philosophically.
>
> Ariel
>
> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>
> @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't
> need any designated leader for C* but I am assuming the paxos that is
> implemented today for LWT's requires Leader election and If so, don't we
> need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1
> constraint to tolerate F failures ? I understand it is not needed when not
> using LWT's since Cassandra is a master-less system.
>
> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali  wrote:
>
> Thanks Ariel! Yes I knew there are so many variations and optimizations of
> Paxos. I just wanted to see if we had any plans on improving the existing
> Paxos implementation and it is great to see the work is under progress! I
> am going to follow that ticket and read up the references pointed in it
>
>
> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> Cassandra's implementation of Paxos doesn't implement many optimizations
> that would drastically improve throughput and latency. You need consensus,
> but it doesn't have to be exorbitantly expensive and fall over under any
> kind of contention.
>
> For instance you could implement EPaxos https://issues.apache.o
> rg/jira/browse/CASSANDRA-6246
> ,
> batch multiple operations into the same Paxos round, have an affinity for a
> specific proposer for a specific partition, implement asynchronous commit,
> use a more efficient implementation of the Paxos log, and maybe other
> things.
>
>
> Ariel
>
>
>
> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>
> Hi Kant,
>
> If you read the published papers about Paxos, you will most probably
> recognize that there is no way to "do it better". This is a conceptional
> thing due to the nature of distributed systems + the CAP theorem.
> If you want A+P in the triangle, then C is very expensive. CS is made for
> A+P mostly with tunable C. In ACID databases this is a completely different
> thing as they are mostly either not partition tolerant, not highly
> available or not scalable (in a distributed manner, not speaking of
> "monolithic super servers").
>
> There is no free lunch ...
>
>
> 2017-02-10 11:09 GMT+01:00 Kant Kodali :
>
> "That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra."
>
> yes LWT's are expensive. Are there any plans to make this better?
>
> On Fri, Feb 10, 2017 at 12:17 AM, 

Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Ariel Weisberg
Hi,



No it's not going to be in 3.11.x. The earliest release it could make it
into is 4.0.


Ariel



On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:

> Hi Ariel,

> 

> Can we really expect the fix in 3.11.x as the ticket
> https://issues.apache.org/jira/browse/CASSANDRA-6246[1] says?
> 

> Thanks,

> kant

> 

> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg
>  wrote:
>> __

>> Hi,

>> 

>> That would work and would help a lot with the dueling proposer issue.
>> 

>> A lot of the leader election stuff is designed to reduce the number
>> of roundtrips and not just address the dueling proposer issue. Those
>> will have downtime because it's there for correctness. Just adding an
>> affinity for a specific proposer is probably a free lunch.
>> 

>> I don't think you can group keys because the Paxos proposals are per
>> partition which is why we get linear scale out for Paxos. I don't
>> believe it's linearizable across multiple partitions. You can use the
>> clustering key and deterministically pick one of the live replicas
>> for that clustering key. Sort the list of replicas by IP, hash the
>> clustering key, use the hash as an index into the list of replicas.
>> 

>> Batching is of limited usefulness because we only use Paxos for CAS I
>> think? So in a batch by definition all but one will fail the CAS.
>> This is something where a distinguished coordinator could help by
>> failing the rest of the contending requests more inexpensively than
>> it currently does.
>> 

>> 

>> Ariel

>> 

>> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:

>>> 

>>> 

>>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg 
>>> wrote:
 __

 Hi,

 

 Classic Paxos doesn't have a leader. There are variants on the
 original Lamport approach that will elect a leader (or some other
 variation like Mencius) to improve throughput, latency, and
 performance under contention. Cassandra implements the approach
 from the beginning of "Paxos Made Simple" (https://goo.gl/SrP0Wb)
 with no additional optimizations that I am aware of. There is no
 distinguished proposer (leader).
 

 That paper does  go on to discuss electing a distinguished
 proposer, but that was never done for C*. I believe it's not
 considered a good fit for C* philosophically.
 

 Ariel

 

 On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:

> @Ariel Weisberg EPaxos looks very interesting as it looks like it
> doesn't need any designated leader for C* but I am assuming the
> paxos that is implemented today for LWT's requires Leader election
> and If so, don't we need to have an odd number of nodes or racks
> or DC's to satisfy N = 2F + 1 constraint to tolerate F failures ?
> I understand it is not needed when not using LWT's since Cassandra
> is a master-less system.
> 

> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali 
> wrote:
>> Thanks Ariel! Yes I knew there are so many variations and
>> optimizations of Paxos. I just wanted to see if we had any plans
>> on improving the existing Paxos implementation and it is great to
>> see the work is under progress! I am going to follow that ticket
>> and read up the references pointed in it
>> 

>> 

>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg
>>  wrote:
>>> __

>>> Hi,

>>> 

>>> Cassandra's implementation of Paxos doesn't implement many
>>> optimizations that would drastically improve throughput and
>>> latency. You need consensus, but it doesn't have to be
>>> exorbitantly expensive and fall over under any kind of
>>> contention.
>>> 

>>> For instance you could implement EPaxos
>>> https://issues.apache.org/jira/browse/CASSANDRA-6246[2], batch
>>> multiple operations into the same Paxos round, have an affinity
>>> for a specific proposer for a specific partition, implement
>>> asynchronous commit, use a more efficient implementation of the
>>> Paxos log, and maybe other things.
>>> 

>>> 

>>> Ariel

>>> 

>>> 

>>> 

>>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:

 Hi Kant,

 

 If you read the published papers about Paxos, you will most
 probably recognize that there is no way to "do it better". This
 is a conceptional thing due to the nature of distributed
 systems + the CAP theorem.
 If you want A+P in the triangle, then C is very expensive. CS
 is made for A+P mostly with tunable C. In ACID databases this
 is a completely different thing as they are mostly either not
 partition tolerant, not highly available or not scalable (in a
 distributed manner, not speaking of "monolithic super
 servers").
 

 There is no free lunch ...

 

 

 

Re: How does cassandra achieve Linearizability?

2017-02-22 Thread Kant Kodali
Hi Ariel,

Can we really expect the fix in 3.11.x as the ticket https://issues.apache.
org/jira/browse/CASSANDRA-6246

 says?

Thanks,
kant

On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg  wrote:

> Hi,
>
> That would work and would help a lot with the dueling proposer issue.
>
> A lot of the leader election stuff is designed to reduce the number of
> roundtrips and not just address the dueling proposer issue. Those will have
> downtime because it's there for correctness. Just adding an affinity for a
> specific proposer is probably a free lunch.
>
> I don't think you can group keys because the Paxos proposals are per
> partition which is why we get linear scale out for Paxos. I don't believe
> it's linearizable across multiple partitions. You can use the clustering
> key and deterministically pick one of the live replicas for that clustering
> key. Sort the list of replicas by IP, hash the clustering key, use the hash
> as an index into the list of replicas.
>
> Batching is of limited usefulness because we only use Paxos for CAS I
> think? So in a batch by definition all but one will fail the CAS. This is
> something where a distinguished coordinator could help by failing the rest
> of the contending requests more inexpensively than it currently does.
>
> Ariel
> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>
>
>
> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> Classic Paxos doesn't have a leader. There are variants on the original
> Lamport approach that will elect a leader (or some other variation like
> Mencius) to improve throughput, latency, and performance under contention.
> Cassandra implements the approach from the beginning of "Paxos Made Simple"
> (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware
> of. There is no distinguished proposer (leader).
>
> That paper does go on to discuss electing a distinguished proposer, but
> that was never done for C*. I believe it's not considered a good fit for C*
> philosophically.
>
> Ariel
>
> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>
> @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't
> need any designated leader for C* but I am assuming the paxos that is
> implemented today for LWT's requires Leader election and If so, don't we
> need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1
> constraint to tolerate F failures ? I understand it is not needed when not
> using LWT's since Cassandra is a master-less system.
>
> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali  wrote:
>
> Thanks Ariel! Yes I knew there are so many variations and optimizations of
> Paxos. I just wanted to see if we had any plans on improving the existing
> Paxos implementation and it is great to see the work is under progress! I
> am going to follow that ticket and read up the references pointed in it
>
>
> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> Cassandra's implementation of Paxos doesn't implement many optimizations
> that would drastically improve throughput and latency. You need consensus,
> but it doesn't have to be exorbitantly expensive and fall over under any
> kind of contention.
>
> For instance you could implement EPaxos https://issues.apache.o
> rg/jira/browse/CASSANDRA-6246
> ,
> batch multiple operations into the same Paxos round, have an affinity for a
> specific proposer for a specific partition, implement asynchronous commit,
> use a more efficient implementation of the Paxos log, and maybe other
> things.
>
>
> Ariel
>
>
>
> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>
> Hi Kant,
>
> If you read the published papers about Paxos, you will most probably
> recognize that there is no way to "do it better". This is a conceptional
> thing due to the nature of distributed systems + the CAP theorem.
> If you want A+P in the triangle, then C is very expensive. CS is made for
> A+P mostly with tunable C. In ACID databases this is a completely different
> thing as they are mostly either not partition tolerant, not highly
> available or not scalable (in a distributed manner, not speaking of
> "monolithic super servers").
>
> There is no free lunch ...
>
>
> 2017-02-10 11:09 GMT+01:00 Kant Kodali :
>
> "That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra."
>
> yes LWT's are expensive. Are there any plans to make this better?
>
> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali  wrote:
>
> Hi Jon,
>
> Thanks a lot for your response. I am well aware that the LWW != LWT but I
> was talking more in terms of LWW with respective to LWT's which I believe
> you answered. so thanks much!
>
>
> kant
>
>
> On Thu, Feb 9, 

Re: How does cassandra achieve Linearizability?

2017-02-16 Thread Ariel Weisberg
Hi,



That would work and would help a lot with the dueling proposer issue.



A lot of the leader election stuff is designed to reduce the number of
roundtrips and not just address the dueling proposer issue. Those will
have downtime because it's there for correctness. Just adding an
affinity for a specific proposer is probably a free lunch.


I don't think you can group keys because the Paxos proposals are per
partition which is why we get linear scale out for Paxos. I don't
believe it's linearizable across multiple partitions. You can use the
clustering key and deterministically pick one of the live replicas for
that clustering key. Sort the list of replicas by IP, hash the
clustering key, use the hash as an index into the list of replicas.


Batching is of limited usefulness because we only use Paxos for CAS I
think? So in a batch by definition all but one will fail the CAS. This
is something where a distinguished coordinator could help by failing
the rest of the contending requests more inexpensively than it
currently does.


Ariel

On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:

> 

> 

> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg
>  wrote:
>> __

>> Hi,

>> 

>> Classic Paxos doesn't have a leader. There are variants on the
>> original Lamport approach that will elect a leader (or some other
>> variation like Mencius) to improve throughput, latency, and
>> performance under contention. Cassandra implements the approach from
>> the beginning of "Paxos Made Simple" (https://goo.gl/SrP0Wb) with no
>> additional optimizations that I am aware of. There is no
>> distinguished proposer (leader).
>> 

>> That paper does  go on to discuss electing a distinguished proposer,
>> but that was never done for C*. I believe it's not considered a good
>> fit for C* philosophically.
>> 

>> Ariel

>> 

>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:

>>> @Ariel Weisberg EPaxos looks very interesting as it looks like it
>>> doesn't need any designated leader for C* but I am assuming the
>>> paxos that is implemented today for LWT's requires Leader election
>>> and If so, don't we need to have an odd number of nodes or racks or
>>> DC's to satisfy N = 2F + 1 constraint to tolerate F failures ? I
>>> understand it is not needed when not using LWT's since Cassandra is
>>> a master-less system.
>>> 

>>> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali 
>>> wrote:
 Thanks Ariel! Yes I knew there are so many variations and
 optimizations of Paxos. I just wanted to see if we had any plans on
 improving the existing Paxos implementation and it is great to see
 the work is under progress! I am going to follow that ticket and
 read up the references pointed in it
 

 

 On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg 
 wrote:
> __

> Hi,

> 

> Cassandra's implementation of Paxos doesn't implement many
> optimizations that would drastically improve throughput and
> latency. You need consensus, but it doesn't have to be
> exorbitantly expensive and fall over under any kind of contention.
> 

> For instance you could implement EPaxos
> https://issues.apache.org/jira/browse/CASSANDRA-6246[1], batch
> multiple operations into the same Paxos round, have an affinity
> for a specific proposer for a specific partition, implement
> asynchronous commit, use a more efficient implementation of the
> Paxos log, and maybe other things.
> 

> 

> Ariel

> 

> 

> 

> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:

>> Hi Kant,

>> 

>> If you read the published papers about Paxos, you will most
>> probably recognize that there is no way to "do it better". This
>> is a conceptional thing due to the nature of distributed systems
>> + the CAP theorem.
>> If you want A+P in the triangle, then C is very expensive. CS is
>> made for A+P mostly with tunable C. In ACID databases this is a
>> completely different thing as they are mostly either not
>> partition tolerant, not highly available or not scalable (in a
>> distributed manner, not speaking of "monolithic super servers").
>> 

>> There is no free lunch ...

>> 

>> 

>> 2017-02-10 11:09 GMT+01:00 Kant Kodali :

>>> "That’s the safety blanket everyone wants but is extremely
>>> expensive, especially in Cassandra."
>>> 

>>> yes LWT's are expensive. Are there any plans to make this
>>> better?
>>> 

>>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali
>>>  wrote:
 Hi Jon,

 

 Thanks a lot for your response. I am well aware that the LWW !=
 LWT but I was talking more in terms of LWW with respective to
 LWT's which I believe you answered. so thanks much!
 

 

 kant

 

 

 

Re: How does cassandra achieve Linearizability?

2017-02-16 Thread Edward Capriolo
On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg  wrote:

> Hi,
>
> Classic Paxos doesn't have a leader. There are variants on the original
> Lamport approach that will elect a leader (or some other variation like
> Mencius) to improve throughput, latency, and performance under contention.
> Cassandra implements the approach from the beginning of "Paxos Made Simple"
> (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware
> of. There is no distinguished proposer (leader).
>
> That paper does go on to discuss electing a distinguished proposer, but
> that was never done for C*. I believe it's not considered a good fit for C*
> philosophically.
>
> Ariel
>
> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>
> @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't
> need any designated leader for C* but I am assuming the paxos that is
> implemented today for LWT's requires Leader election and If so, don't we
> need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1
> constraint to tolerate F failures ? I understand it is not needed when not
> using LWT's since Cassandra is a master-less system.
>
> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali  wrote:
>
> Thanks Ariel! Yes I knew there are so many variations and optimizations of
> Paxos. I just wanted to see if we had any plans on improving the existing
> Paxos implementation and it is great to see the work is under progress! I
> am going to follow that ticket and read up the references pointed in it
>
>
> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg  wrote:
>
>
> Hi,
>
> Cassandra's implementation of Paxos doesn't implement many optimizations
> that would drastically improve throughput and latency. You need consensus,
> but it doesn't have to be exorbitantly expensive and fall over under any
> kind of contention.
>
> For instance you could implement EPaxos https://issues.apache.o
> rg/jira/browse/CASSANDRA-6246
> ,
> batch multiple operations into the same Paxos round, have an affinity for a
> specific proposer for a specific partition, implement asynchronous commit,
> use a more efficient implementation of the Paxos log, and maybe other
> things.
>
>
> Ariel
>
>
>
> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>
> Hi Kant,
>
> If you read the published papers about Paxos, you will most probably
> recognize that there is no way to "do it better". This is a conceptional
> thing due to the nature of distributed systems + the CAP theorem.
> If you want A+P in the triangle, then C is very expensive. CS is made for
> A+P mostly with tunable C. In ACID databases this is a completely different
> thing as they are mostly either not partition tolerant, not highly
> available or not scalable (in a distributed manner, not speaking of
> "monolithic super servers").
>
> There is no free lunch ...
>
>
> 2017-02-10 11:09 GMT+01:00 Kant Kodali :
>
> "That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra."
>
> yes LWT's are expensive. Are there any plans to make this better?
>
> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali  wrote:
>
> Hi Jon,
>
> Thanks a lot for your response. I am well aware that the LWW != LWT but I
> was talking more in terms of LWW with respective to LWT's which I believe
> you answered. so thanks much!
>
>
> kant
>
>
> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad 
> wrote:
>
> LWT != Last Write Wins.  They are totally different.
>
> LWTs give you (assuming you also read at SERIAL) “atomic consistency”,
> meaning you are able to perform operations atomically and in isolation.
> That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra.  The lightweight part, btw, may be a little
> optimistic, especially if a key is under contention.  With regard to the
> “last write” part you’re asking about - w/ LWT Cassandra provides the
> timestamp and manages it as part of the ballot, and it always is
> increasing.  See 
> org.apache.cassandra.service.ClientState#getTimestampForPaxos.
> From the code:
>
>  * Returns a timestamp suitable for paxos given the timestamp of the last
> known commit (or in progress update).
>  * Paxos ensures that the timestamp it uses for commits respects the
> serial order of those commits. It does so
>  * by having each replica reject any proposal whose timestamp is not
> strictly greater than the last proposal it
>  * accepted. So in practice, which timestamp we use for a given proposal
> doesn't affect correctness but it does
>  * affect the chance of making progress (if we pick a timestamp lower than
> what has been proposed before, our
>  * new proposal will just get rejected).
>
> Effectively paxos removes the ability to use custom timestamps and
> addresses clock variance by 

Re: How does cassandra achieve Linearizability?

2017-02-16 Thread Ariel Weisberg
Hi,



Classic Paxos doesn't have a leader. There are variants on the
original Lamport approach that will elect a leader (or some other
variation like Mencius) to improve throughput, latency, and
performance under contention. Cassandra implements the approach from
the beginning of "Paxos Made Simple" (https://goo.gl/SrP0Wb) with no
additional optimizations that I am aware of. There is no distinguished
proposer (leader).


That paper does  go on to discuss electing a distinguished proposer, but
that was never done for C*. I believe it's not considered a good fit for
C* philosophically.


Ariel



On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:

> @Ariel Weisberg EPaxos looks very interesting as it looks like it
> doesn't need any designated leader for C* but I am assuming the paxos
> that is implemented today for LWT's requires Leader election and If
> so, don't we need to have an odd number of nodes or racks or DC's to
> satisfy N = 2F + 1 constraint to tolerate F failures ? I understand
> it is not needed when not using LWT's since Cassandra is a master-
> less system.
> 

> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali
>  wrote:
>> Thanks Ariel! Yes I knew there are so many variations and
>> optimizations of Paxos. I just wanted to see if we had any plans on
>> improving the existing Paxos implementation and it is great to see
>> the work is under progress! I am going to follow that ticket and read
>> up the references pointed in it
>> 

>> 

>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg
>>  wrote:
>>> __

>>> Hi,

>>> 

>>> Cassandra's implementation of Paxos doesn't implement many
>>> optimizations that would drastically improve throughput and latency.
>>> You need consensus, but it doesn't have to be exorbitantly expensive
>>> and fall over under any kind of contention.
>>> 

>>> For instance you could implement EPaxos
>>> https://issues.apache.org/jira/browse/CASSANDRA-6246[1], batch
>>> multiple operations into the same Paxos round, have an affinity for
>>> a specific proposer for a specific partition, implement asynchronous
>>> commit, use a more efficient implementation of the Paxos log, and
>>> maybe other things.
>>> 

>>> 

>>> Ariel

>>> 

>>> 

>>> 

>>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:

 Hi Kant,

 

 If you read the published papers about Paxos, you will most
 probably recognize that there is no way to "do it better". This is
 a conceptional thing due to the nature of distributed systems + the
 CAP theorem.
 If you want A+P in the triangle, then C is very expensive. CS is
 made for A+P mostly with tunable C. In ACID databases this is a
 completely different thing as they are mostly either not partition
 tolerant, not highly available or not scalable (in a distributed
 manner, not speaking of "monolithic super servers").
 

 There is no free lunch ...

 

 

 2017-02-10 11:09 GMT+01:00 Kant Kodali :

> "That’s the safety blanket everyone wants but is extremely
> expensive, especially in Cassandra."
> 

> yes LWT's are expensive. Are there any plans to make this better?
> 

> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali 
> wrote:
>> Hi Jon,

>> 

>> Thanks a lot for your response. I am well aware that the LWW !=
>> LWT but I was talking more in terms of LWW with respective to
>> LWT's which I believe you answered. so thanks much!
>> 

>> 

>> kant

>> 

>> 

>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad
>>  wrote:
>>> LWT != Last Write Wins.  They are totally different.  

>>> 

>>> LWTs give you (assuming you also read at SERIAL) “atomic
>>> consistency”, meaning you are able to perform operations
>>> atomically and in isolation.  That’s the safety blanket everyone
>>> wants but is extremely expensive, especially in Cassandra.  The
>>> lightweight part, btw, may be a little optimistic, especially if
>>> a key is under contention.  With regard to the “last write” part
>>> you’re asking about - w/ LWT Cassandra provides the timestamp
>>> and manages it as part of the ballot, and it always is
>>> increasing.  See
>>> org.apache.cassandra.service.ClientState#getTimestampForPaxos.
>>> From the code:
>>> 

>>>  * Returns a timestamp suitable for paxos given the timestamp of
>>>the last known commit (or in progress update).
>>>  * Paxos ensures that the timestamp it uses for commits respects
>>>the serial order of those commits. It does so
>>>  * by having each replica reject any proposal whose timestamp is
>>>not strictly greater than the last proposal it
>>>  * accepted. So in practice, which timestamp we use for a given
>>>proposal doesn't affect correctness but it does
>>>  * affect the chance of making progress 

Re: How does cassandra achieve Linearizability?

2017-02-16 Thread Kant Kodali
@Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't
need any designated leader for C* but I am assuming the paxos that is
implemented today for LWT's requires Leader election and If so, don't we
need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1
constraint to tolerate F failures ? I understand it is not needed when not
using LWT's since Cassandra is a master-less system.

On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali  wrote:

> Thanks Ariel! Yes I knew there are so many variations and optimizations of
> Paxos. I just wanted to see if we had any plans on improving the existing
> Paxos implementation and it is great to see the work is under progress! I
> am going to follow that ticket and read up the references pointed in it
>
>
> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg  wrote:
>
>> Hi,
>>
>> Cassandra's implementation of Paxos doesn't implement many optimizations
>> that would drastically improve throughput and latency. You need consensus,
>> but it doesn't have to be exorbitantly expensive and fall over under any
>> kind of contention.
>>
>> For instance you could implement EPaxos https://issues.apache.o
>> rg/jira/browse/CASSANDRA-6246
>> ,
>> batch multiple operations into the same Paxos round, have an affinity for a
>> specific proposer for a specific partition, implement asynchronous commit,
>> use a more efficient implementation of the Paxos log, and maybe other
>> things.
>>
>> Ariel
>>
>>
>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>>
>> Hi Kant,
>>
>> If you read the published papers about Paxos, you will most probably
>> recognize that there is no way to "do it better". This is a conceptional
>> thing due to the nature of distributed systems + the CAP theorem.
>> If you want A+P in the triangle, then C is very expensive. CS is made for
>> A+P mostly with tunable C. In ACID databases this is a completely different
>> thing as they are mostly either not partition tolerant, not highly
>> available or not scalable (in a distributed manner, not speaking of
>> "monolithic super servers").
>>
>> There is no free lunch ...
>>
>>
>> 2017-02-10 11:09 GMT+01:00 Kant Kodali :
>>
>> "That’s the safety blanket everyone wants but is extremely expensive,
>> especially in Cassandra."
>>
>> yes LWT's are expensive. Are there any plans to make this better?
>>
>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali  wrote:
>>
>> Hi Jon,
>>
>> Thanks a lot for your response. I am well aware that the LWW != LWT but I
>> was talking more in terms of LWW with respective to LWT's which I believe
>> you answered. so thanks much!
>>
>>
>> kant
>>
>>
>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad 
>> wrote:
>>
>> LWT != Last Write Wins.  They are totally different.
>>
>> LWTs give you (assuming you also read at SERIAL) “atomic consistency”,
>> meaning you are able to perform operations atomically and in isolation.
>> That’s the safety blanket everyone wants but is extremely expensive,
>> especially in Cassandra.  The lightweight part, btw, may be a little
>> optimistic, especially if a key is under contention.  With regard to the
>> “last write” part you’re asking about - w/ LWT Cassandra provides the
>> timestamp and manages it as part of the ballot, and it always is
>> increasing.  See 
>> org.apache.cassandra.service.ClientState#getTimestampForPaxos.
>> From the code:
>>
>>  * Returns a timestamp suitable for paxos given the timestamp of the last
>> known commit (or in progress update).
>>  * Paxos ensures that the timestamp it uses for commits respects the
>> serial order of those commits. It does so
>>  * by having each replica reject any proposal whose timestamp is not
>> strictly greater than the last proposal it
>>  * accepted. So in practice, which timestamp we use for a given proposal
>> doesn't affect correctness but it does
>>  * affect the chance of making progress (if we pick a timestamp lower
>> than what has been proposed before, our
>>  * new proposal will just get rejected).
>>
>> Effectively paxos removes the ability to use custom timestamps and
>> addresses clock variance by rejecting ballots with timestamps less than
>> what was last seen.  You can learn more by reading through the other
>> comments and code in that file.
>>
>> Last write wins is a free for all that guarantees you *nothing* except
>> the timestamp is used as a tiebreaker.  Here we acknowledge things like the
>> speed of light as being a real problem that isn’t going away anytime soon.
>> This problem is sometimes addressed with event sourcing rather than
>> mutating in place.
>>
>> Hope this helps.
>>
>>
>> Jon
>>
>>
>>
>>
>> On Feb 9, 2017, at 5:21 PM, Kant Kodali  wrote:
>>
>> @Justin I read this article http://www.datastax.com/dev/bl
>> 

Re: How does cassandra achieve Linearizability?

2017-02-10 Thread Kant Kodali
Thanks Ariel! Yes I knew there are so many variations and optimizations of
Paxos. I just wanted to see if we had any plans on improving the existing
Paxos implementation and it is great to see the work is under progress! I
am going to follow that ticket and read up the references pointed in it


On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg  wrote:

> Hi,
>
> Cassandra's implementation of Paxos doesn't implement many optimizations
> that would drastically improve throughput and latency. You need consensus,
> but it doesn't have to be exorbitantly expensive and fall over under any
> kind of contention.
>
> For instance you could implement EPaxos https://issues.apache.
> org/jira/browse/CASSANDRA-6246
> ,
> batch multiple operations into the same Paxos round, have an affinity for a
> specific proposer for a specific partition, implement asynchronous commit,
> use a more efficient implementation of the Paxos log, and maybe other
> things.
>
> Ariel
>
>
> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>
> Hi Kant,
>
> If you read the published papers about Paxos, you will most probably
> recognize that there is no way to "do it better". This is a conceptional
> thing due to the nature of distributed systems + the CAP theorem.
> If you want A+P in the triangle, then C is very expensive. CS is made for
> A+P mostly with tunable C. In ACID databases this is a completely different
> thing as they are mostly either not partition tolerant, not highly
> available or not scalable (in a distributed manner, not speaking of
> "monolithic super servers").
>
> There is no free lunch ...
>
>
> 2017-02-10 11:09 GMT+01:00 Kant Kodali :
>
> "That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra."
>
> yes LWT's are expensive. Are there any plans to make this better?
>
> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali  wrote:
>
> Hi Jon,
>
> Thanks a lot for your response. I am well aware that the LWW != LWT but I
> was talking more in terms of LWW with respective to LWT's which I believe
> you answered. so thanks much!
>
>
> kant
>
>
> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad 
> wrote:
>
> LWT != Last Write Wins.  They are totally different.
>
> LWTs give you (assuming you also read at SERIAL) “atomic consistency”,
> meaning you are able to perform operations atomically and in isolation.
> That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra.  The lightweight part, btw, may be a little
> optimistic, especially if a key is under contention.  With regard to the
> “last write” part you’re asking about - w/ LWT Cassandra provides the
> timestamp and manages it as part of the ballot, and it always is
> increasing.  See 
> org.apache.cassandra.service.ClientState#getTimestampForPaxos.
> From the code:
>
>  * Returns a timestamp suitable for paxos given the timestamp of the last
> known commit (or in progress update).
>  * Paxos ensures that the timestamp it uses for commits respects the
> serial order of those commits. It does so
>  * by having each replica reject any proposal whose timestamp is not
> strictly greater than the last proposal it
>  * accepted. So in practice, which timestamp we use for a given proposal
> doesn't affect correctness but it does
>  * affect the chance of making progress (if we pick a timestamp lower than
> what has been proposed before, our
>  * new proposal will just get rejected).
>
> Effectively paxos removes the ability to use custom timestamps and
> addresses clock variance by rejecting ballots with timestamps less than
> what was last seen.  You can learn more by reading through the other
> comments and code in that file.
>
> Last write wins is a free for all that guarantees you *nothing* except the
> timestamp is used as a tiebreaker.  Here we acknowledge things like the
> speed of light as being a real problem that isn’t going away anytime soon.
> This problem is sometimes addressed with event sourcing rather than
> mutating in place.
>
> Hope this helps.
>
>
> Jon
>
>
>
>
> On Feb 9, 2017, at 5:21 PM, Kant Kodali  wrote:
>
> @Justin I read this article http://www.datastax.com/dev/bl
> og/lightweight-transactions-in-cassandra-2-0. And it clearly says
> Linearizable consistency can be achieved with LWT's.  so should I assume
> the Linearizability in the context of the above article is possible with
> LWT's and synchronization of clocks through ntpd ? because LWT's also
> follow Last Write Wins. isn't it? Also another question does most of the
> production clusters do setup ntpd? If so what is the time it takes to sync?
> any idea
>
> @Micheal Schuler Are you referring to  something like true time as in
> https://static.googleusercontent.com/media/research.google.c
> om/en//archive/spanner-osdi2012.pdf?  Actually I never 

Re: How does cassandra achieve Linearizability?

2017-02-10 Thread Ariel Weisberg
Hi,



Cassandra's implementation of Paxos doesn't implement many optimizations
that would drastically improve throughput and latency. You need
consensus, but it doesn't have to be exorbitantly expensive and fall
over under any kind of contention.


For instance you could implement EPaxos
https://issues.apache.org/jira/browse/CASSANDRA-6246[1], batch multiple
operations into the same Paxos round, have an affinity for a specific
proposer for a specific partition, implement asynchronous commit, use a
more efficient implementation of the Paxos log, and maybe other things.


Ariel





On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:

> Hi Kant,

> 

> If you read the published papers about Paxos, you will most probably
> recognize that there is no way to "do it better". This is a
> conceptional thing due to the nature of distributed systems + the CAP
> theorem.
> If you want A+P in the triangle, then C is very expensive. CS is made
> for A+P mostly with tunable C. In ACID databases this is a completely
> different thing as they are mostly either not partition tolerant, not
> highly available or not scalable (in a distributed manner, not
> speaking of "monolithic super servers").
> 

> There is no free lunch ...

> 

> 

> 2017-02-10 11:09 GMT+01:00 Kant Kodali :

>> "That’s the safety blanket everyone wants but is extremely expensive,
>> especially in Cassandra."
>> 

>> yes LWT's are expensive. Are there any plans to make this better? 

>> 

>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali
>>  wrote:
>>> Hi Jon,

>>> 

>>> Thanks a lot for your response. I am well aware that the LWW != LWT
>>> but I was talking more in terms of LWW with respective to LWT's
>>> which I believe you answered. so thanks much!
>>> 

>>> 

>>> kant

>>> 

>>> 

>>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad
>>>  wrote:
 LWT != Last Write Wins.  They are totally different.  

 

 LWTs give you (assuming you also read at SERIAL) “atomic
 consistency”, meaning you are able to perform operations atomically
 and in isolation.  That’s the safety blanket everyone wants but is
 extremely expensive, especially in Cassandra.  The lightweight
 part, btw, may be a little optimistic, especially if a key is under
 contention.  With regard to the “last write” part you’re asking
 about - w/ LWT Cassandra provides the timestamp and manages it as
 part of the ballot, and it always is increasing.  See
 org.apache.cassandra.service.ClientState#getTimestampForPaxos.
 From the code:
 

  * Returns a timestamp suitable for paxos given the timestamp of
the last known commit (or in progress update).
  * Paxos ensures that the timestamp it uses for commits respects
the serial order of those commits. It does so
  * by having each replica reject any proposal whose timestamp is
not strictly greater than the last proposal it
  * accepted. So in practice, which timestamp we use for a given
proposal doesn't affect correctness but it does
  * affect the chance of making progress (if we pick a timestamp
lower than what has been proposed before, our
  * new proposal will just get rejected).

 

 Effectively paxos removes the ability to use custom timestamps and
 addresses clock variance by rejecting ballots with timestamps less
 than what was last seen.  You can learn more by reading through the
 other comments and code in that file.
 

 Last write wins is a free for all that guarantees you *nothing*
 except the timestamp is used as a tiebreaker.  Here we acknowledge
 things like the speed of light as being a real problem that isn’t
 going away anytime soon.  This problem is sometimes addressed with
 event sourcing rather than mutating in place.
 

 Hope this helps.

 

 

 Jon

 

 

 

 

> On Feb 9, 2017, at 5:21 PM, Kant Kodali  wrote:
> 

> @Justin I read this article
> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0.
> And it clearly says Linearizable consistency can be achieved with
> LWT's.  so should I assume the Linearizability in the context of
> the above article is possible with LWT's and synchronization of
> clocks through ntpd ? because LWT's also follow Last Write Wins.
> isn't it? Also another question does most of the production
> clusters do setup ntpd? If so what is the time it takes to sync?
> any idea
> 

> @Micheal Schuler Are you referring to  something like true time as
> in
> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf?
> Actually I never heard of setting up GPS modules and how that can
> be helpful. Let me research on that but good point.
> 

> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler
> 

Re: How does cassandra achieve Linearizability?

2017-02-10 Thread Benjamin Roth
Hi Kant,

If you read the published papers about Paxos, you will most probably
recognize that there is no way to "do it better". This is a conceptional
thing due to the nature of distributed systems + the CAP theorem.
If you want A+P in the triangle, then C is very expensive. CS is made for
A+P mostly with tunable C. In ACID databases this is a completely different
thing as they are mostly either not partition tolerant, not highly
available or not scalable (in a distributed manner, not speaking of
"monolithic super servers").

There is no free lunch ...


2017-02-10 11:09 GMT+01:00 Kant Kodali :

> "That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra."
>
> yes LWT's are expensive. Are there any plans to make this better?
>
> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali  wrote:
>
>> Hi Jon,
>>
>> Thanks a lot for your response. I am well aware that the LWW != LWT but I
>> was talking more in terms of LWW with respective to LWT's which I believe
>> you answered. so thanks much!
>>
>> kant
>>
>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad 
>> wrote:
>>
>>> LWT != Last Write Wins.  They are totally different.
>>>
>>> LWTs give you (assuming you also read at SERIAL) “atomic consistency”,
>>> meaning you are able to perform operations atomically and in isolation.
>>> That’s the safety blanket everyone wants but is extremely expensive,
>>> especially in Cassandra.  The lightweight part, btw, may be a little
>>> optimistic, especially if a key is under contention.  With regard to the
>>> “last write” part you’re asking about - w/ LWT Cassandra provides the
>>> timestamp and manages it as part of the ballot, and it always is
>>> increasing.  See org.apache.cassandra.servi
>>> ce.ClientState#getTimestampForPaxos.  From the code:
>>>
>>>  * Returns a timestamp suitable for paxos given the timestamp of the
>>> last known commit (or in progress update).
>>>  * Paxos ensures that the timestamp it uses for commits respects the
>>> serial order of those commits. It does so
>>>  * by having each replica reject any proposal whose timestamp is not
>>> strictly greater than the last proposal it
>>>  * accepted. So in practice, which timestamp we use for a given proposal
>>> doesn't affect correctness but it does
>>>  * affect the chance of making progress (if we pick a timestamp lower
>>> than what has been proposed before, our
>>>  * new proposal will just get rejected).
>>>
>>> Effectively paxos removes the ability to use custom timestamps and
>>> addresses clock variance by rejecting ballots with timestamps less than
>>> what was last seen.  You can learn more by reading through the other
>>> comments and code in that file.
>>>
>>> Last write wins is a free for all that guarantees you *nothing* except
>>> the timestamp is used as a tiebreaker.  Here we acknowledge things like the
>>> speed of light as being a real problem that isn’t going away anytime soon.
>>> This problem is sometimes addressed with event sourcing rather than
>>> mutating in place.
>>>
>>> Hope this helps.
>>>
>>> Jon
>>>
>>>
>>> On Feb 9, 2017, at 5:21 PM, Kant Kodali  wrote:
>>>
>>> @Justin I read this article http://www.datastax.com/dev/bl
>>> og/lightweight-transactions-in-cassandra-2-0. And it clearly says
>>> Linearizable consistency can be achieved with LWT's.  so should I assume
>>> the Linearizability in the context of the above article is possible
>>> with LWT's and synchronization of clocks through ntpd ? because LWT's also
>>> follow Last Write Wins. isn't it? Also another question does most of the
>>> production clusters do setup ntpd? If so what is the time it takes to sync?
>>> any idea
>>>
>>> @Micheal Schuler Are you referring to  something like true time as in
>>> https://static.googleusercontent.com/media/research.google.c
>>> om/en//archive/spanner-osdi2012.pdf?  Actually I never heard of setting
>>> up GPS modules and how that can be helpful. Let me research on that but
>>> good point.
>>>
>>> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler 
>>> wrote:
>>>
 If you require the best precision you can get, setting up a pair of
 stratum 1 ntpd masters in each data center location with a GPS modules
 is not terribly complex. Low latency and jitter on servers you manage.
 140ms is a long way away network-wise, and I would suggest that was a
 poor choice of upstream (probably stratum 2 or 3) source.

 As Jonathan mentioned, there's no guarantee from Cassandra, but if you
 need as close as you can get, you'll probably need to do it yourself.

 (I run several stratum 2 ntpd servers for pool.ntp.org)

 --
 Kind regards,
 Michael

 On 02/09/2017 06:47 PM, Kant Kodali wrote:
 > Hi Justin,
 >
 > There are bunch of issues w.r.t to synchronization of clocks when we
 > used ntpd. Also the time it took to sync the clocks 

Re: How does cassandra achieve Linearizability?

2017-02-10 Thread Kant Kodali
"That’s the safety blanket everyone wants but is extremely expensive,
especially in Cassandra."

yes LWT's are expensive. Are there any plans to make this better?

On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali  wrote:

> Hi Jon,
>
> Thanks a lot for your response. I am well aware that the LWW != LWT but I
> was talking more in terms of LWW with respective to LWT's which I believe
> you answered. so thanks much!
>
> kant
>
> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad 
> wrote:
>
>> LWT != Last Write Wins.  They are totally different.
>>
>> LWTs give you (assuming you also read at SERIAL) “atomic consistency”,
>> meaning you are able to perform operations atomically and in isolation.
>> That’s the safety blanket everyone wants but is extremely expensive,
>> especially in Cassandra.  The lightweight part, btw, may be a little
>> optimistic, especially if a key is under contention.  With regard to the
>> “last write” part you’re asking about - w/ LWT Cassandra provides the
>> timestamp and manages it as part of the ballot, and it always is
>> increasing.  See 
>> org.apache.cassandra.service.ClientState#getTimestampForPaxos.
>> From the code:
>>
>>  * Returns a timestamp suitable for paxos given the timestamp of the last
>> known commit (or in progress update).
>>  * Paxos ensures that the timestamp it uses for commits respects the
>> serial order of those commits. It does so
>>  * by having each replica reject any proposal whose timestamp is not
>> strictly greater than the last proposal it
>>  * accepted. So in practice, which timestamp we use for a given proposal
>> doesn't affect correctness but it does
>>  * affect the chance of making progress (if we pick a timestamp lower
>> than what has been proposed before, our
>>  * new proposal will just get rejected).
>>
>> Effectively paxos removes the ability to use custom timestamps and
>> addresses clock variance by rejecting ballots with timestamps less than
>> what was last seen.  You can learn more by reading through the other
>> comments and code in that file.
>>
>> Last write wins is a free for all that guarantees you *nothing* except
>> the timestamp is used as a tiebreaker.  Here we acknowledge things like the
>> speed of light as being a real problem that isn’t going away anytime soon.
>> This problem is sometimes addressed with event sourcing rather than
>> mutating in place.
>>
>> Hope this helps.
>>
>> Jon
>>
>>
>> On Feb 9, 2017, at 5:21 PM, Kant Kodali  wrote:
>>
>> @Justin I read this article http://www.datastax.com/dev/bl
>> og/lightweight-transactions-in-cassandra-2-0. And it clearly says
>> Linearizable consistency can be achieved with LWT's.  so should I assume
>> the Linearizability in the context of the above article is possible with
>> LWT's and synchronization of clocks through ntpd ? because LWT's also
>> follow Last Write Wins. isn't it? Also another question does most of the
>> production clusters do setup ntpd? If so what is the time it takes to sync?
>> any idea
>>
>> @Micheal Schuler Are you referring to  something like true time as in
>> https://static.googleusercontent.com/media/research.google.
>> com/en//archive/spanner-osdi2012.pdf?  Actually I never heard of setting
>> up GPS modules and how that can be helpful. Let me research on that but
>> good point.
>>
>> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler 
>> wrote:
>>
>>> If you require the best precision you can get, setting up a pair of
>>> stratum 1 ntpd masters in each data center location with a GPS modules
>>> is not terribly complex. Low latency and jitter on servers you manage.
>>> 140ms is a long way away network-wise, and I would suggest that was a
>>> poor choice of upstream (probably stratum 2 or 3) source.
>>>
>>> As Jonathan mentioned, there's no guarantee from Cassandra, but if you
>>> need as close as you can get, you'll probably need to do it yourself.
>>>
>>> (I run several stratum 2 ntpd servers for pool.ntp.org)
>>>
>>> --
>>> Kind regards,
>>> Michael
>>>
>>> On 02/09/2017 06:47 PM, Kant Kodali wrote:
>>> > Hi Justin,
>>> >
>>> > There are bunch of issues w.r.t to synchronization of clocks when we
>>> > used ntpd. Also the time it took to sync the clocks was approx 140ms
>>> > (don't quote me on it though because it is reported by our devops :)
>>> >
>>> > we have multiple clients (for example bunch of micro services are
>>> > reading from Cassandra) I am not sure how one can achieve
>>> > Linearizability by setting timestamps on the clients ? since there is
>>> no
>>> > total ordering across multiple clients.
>>> >
>>> > Thanks!
>>> >
>>> >
>>> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron >> > > wrote:
>>> >
>>> > Hi Kant,
>>> >
>>> > Clock synchronization is important - you should ensure that ntpd is
>>> > properly configured on all nodes. If your particular use case is
>>> > especially 

Re: How does cassandra achieve Linearizability?

2017-02-10 Thread Kant Kodali
Hi Jon,

Thanks a lot for your response. I am well aware that the LWW != LWT but I
was talking more in terms of LWW with respective to LWT's which I believe
you answered. so thanks much!

kant

On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad 
wrote:

> LWT != Last Write Wins.  They are totally different.
>
> LWTs give you (assuming you also read at SERIAL) “atomic consistency”,
> meaning you are able to perform operations atomically and in isolation.
> That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra.  The lightweight part, btw, may be a little
> optimistic, especially if a key is under contention.  With regard to the
> “last write” part you’re asking about - w/ LWT Cassandra provides the
> timestamp and manages it as part of the ballot, and it always is
> increasing.  See 
> org.apache.cassandra.service.ClientState#getTimestampForPaxos.
> From the code:
>
>  * Returns a timestamp suitable for paxos given the timestamp of the last
> known commit (or in progress update).
>  * Paxos ensures that the timestamp it uses for commits respects the
> serial order of those commits. It does so
>  * by having each replica reject any proposal whose timestamp is not
> strictly greater than the last proposal it
>  * accepted. So in practice, which timestamp we use for a given proposal
> doesn't affect correctness but it does
>  * affect the chance of making progress (if we pick a timestamp lower than
> what has been proposed before, our
>  * new proposal will just get rejected).
>
> Effectively paxos removes the ability to use custom timestamps and
> addresses clock variance by rejecting ballots with timestamps less than
> what was last seen.  You can learn more by reading through the other
> comments and code in that file.
>
> Last write wins is a free for all that guarantees you *nothing* except the
> timestamp is used as a tiebreaker.  Here we acknowledge things like the
> speed of light as being a real problem that isn’t going away anytime soon.
> This problem is sometimes addressed with event sourcing rather than
> mutating in place.
>
> Hope this helps.
>
> Jon
>
>
> On Feb 9, 2017, at 5:21 PM, Kant Kodali  wrote:
>
> @Justin I read this article http://www.datastax.com/dev/
> blog/lightweight-transactions-in-cassandra-2-0. And it clearly says
> Linearizable consistency can be achieved with LWT's.  so should I assume
> the Linearizability in the context of the above article is possible with
> LWT's and synchronization of clocks through ntpd ? because LWT's also
> follow Last Write Wins. isn't it? Also another question does most of the
> production clusters do setup ntpd? If so what is the time it takes to sync?
> any idea
>
> @Micheal Schuler Are you referring to  something like true time as in
> https://static.googleusercontent.com/media/research.google.com/en//
> archive/spanner-osdi2012.pdf?  Actually I never heard of setting up GPS
> modules and how that can be helpful. Let me research on that but good point.
>
> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler 
> wrote:
>
>> If you require the best precision you can get, setting up a pair of
>> stratum 1 ntpd masters in each data center location with a GPS modules
>> is not terribly complex. Low latency and jitter on servers you manage.
>> 140ms is a long way away network-wise, and I would suggest that was a
>> poor choice of upstream (probably stratum 2 or 3) source.
>>
>> As Jonathan mentioned, there's no guarantee from Cassandra, but if you
>> need as close as you can get, you'll probably need to do it yourself.
>>
>> (I run several stratum 2 ntpd servers for pool.ntp.org)
>>
>> --
>> Kind regards,
>> Michael
>>
>> On 02/09/2017 06:47 PM, Kant Kodali wrote:
>> > Hi Justin,
>> >
>> > There are bunch of issues w.r.t to synchronization of clocks when we
>> > used ntpd. Also the time it took to sync the clocks was approx 140ms
>> > (don't quote me on it though because it is reported by our devops :)
>> >
>> > we have multiple clients (for example bunch of micro services are
>> > reading from Cassandra) I am not sure how one can achieve
>> > Linearizability by setting timestamps on the clients ? since there is no
>> > total ordering across multiple clients.
>> >
>> > Thanks!
>> >
>> >
>> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron > > > wrote:
>> >
>> > Hi Kant,
>> >
>> > Clock synchronization is important - you should ensure that ntpd is
>> > properly configured on all nodes. If your particular use case is
>> > especially sensitive to out-of-order mutations it is possible to set
>> > timestamps on the client side using the
>> > drivers. https://docs.datastax.com/en/d
>> eveloper/java-driver/3.1/manual/query_timestamps/
>> > > anual/query_timestamps/>
>> >
>> > We use our own NTP cluster to reduce clock 

Re: How does cassandra achieve Linearizability?

2017-02-09 Thread Jon Haddad
LWT != Last Write Wins.  They are totally different.  

LWTs give you (assuming you also read at SERIAL) “atomic consistency”, meaning 
you are able to perform operations atomically and in isolation.  That’s the 
safety blanket everyone wants but is extremely expensive, especially in 
Cassandra.  The lightweight part, btw, may be a little optimistic, especially 
if a key is under contention.  With regard to the “last write” part you’re 
asking about - w/ LWT Cassandra provides the timestamp and manages it as part 
of the ballot, and it always is increasing.  See 
org.apache.cassandra.service.ClientState#getTimestampForPaxos.  From the code:

 * Returns a timestamp suitable for paxos given the timestamp of the last known 
commit (or in progress update).
 * Paxos ensures that the timestamp it uses for commits respects the serial 
order of those commits. It does so
 * by having each replica reject any proposal whose timestamp is not strictly 
greater than the last proposal it
 * accepted. So in practice, which timestamp we use for a given proposal 
doesn't affect correctness but it does
 * affect the chance of making progress (if we pick a timestamp lower than what 
has been proposed before, our
 * new proposal will just get rejected).

Effectively paxos removes the ability to use custom timestamps and addresses 
clock variance by rejecting ballots with timestamps less than what was last 
seen.  You can learn more by reading through the other comments and code in 
that file. 

Last write wins is a free for all that guarantees you *nothing* except the 
timestamp is used as a tiebreaker.  Here we acknowledge things like the speed 
of light as being a real problem that isn’t going away anytime soon.  This 
problem is sometimes addressed with event sourcing rather than mutating in 
place.

Hope this helps.

Jon


> On Feb 9, 2017, at 5:21 PM, Kant Kodali  wrote:
> 
> @Justin I read this article 
> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 
> . 
> And it clearly says Linearizable consistency can be achieved with LWT's.  so 
> should I assume the Linearizability in the context of the above article is 
> possible with LWT's and synchronization of clocks through ntpd ? because 
> LWT's also follow Last Write Wins. isn't it? Also another question does most 
> of the production clusters do setup ntpd? If so what is the time it takes to 
> sync? any idea
> 
> @Micheal Schuler Are you referring to  something like true time as in 
> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
>  
> ?
>   Actually I never heard of setting up GPS modules and how that can be 
> helpful. Let me research on that but good point.
> 
> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler  > wrote:
> If you require the best precision you can get, setting up a pair of
> stratum 1 ntpd masters in each data center location with a GPS modules
> is not terribly complex. Low latency and jitter on servers you manage.
> 140ms is a long way away network-wise, and I would suggest that was a
> poor choice of upstream (probably stratum 2 or 3) source.
> 
> As Jonathan mentioned, there's no guarantee from Cassandra, but if you
> need as close as you can get, you'll probably need to do it yourself.
> 
> (I run several stratum 2 ntpd servers for pool.ntp.org )
> 
> --
> Kind regards,
> Michael
> 
> On 02/09/2017 06:47 PM, Kant Kodali wrote:
> > Hi Justin,
> >
> > There are bunch of issues w.r.t to synchronization of clocks when we
> > used ntpd. Also the time it took to sync the clocks was approx 140ms
> > (don't quote me on it though because it is reported by our devops :)
> >
> > we have multiple clients (for example bunch of micro services are
> > reading from Cassandra) I am not sure how one can achieve
> > Linearizability by setting timestamps on the clients ? since there is no
> > total ordering across multiple clients.
> >
> > Thanks!
> >
> >
> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron  > 
> > >> wrote:
> >
> > Hi Kant,
> >
> > Clock synchronization is important - you should ensure that ntpd is
> > properly configured on all nodes. If your particular use case is
> > especially sensitive to out-of-order mutations it is possible to set
> > timestamps on the client side using the
> > drivers. 
> > https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
> >  
> > 
> > 
> >  >  
> 

Re: How does cassandra achieve Linearizability?

2017-02-09 Thread Michael Shuler
On 02/09/2017 07:21 PM, Kant Kodali wrote:
> @Justin I read this article
> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0.
> And it clearly says Linearizable consistency can be achieved with LWT's.
>  so should I assume the Linearizability in the context of the above
> article is possible with LWT's and synchronization of clocks through
> ntpd ? because LWT's also follow Last Write Wins. isn't it? Also another
> question does most of the production clusters do setup ntpd? If so what
> is the time it takes to sync? any idea

I'll let the others talk more intimately about LWT, but as for NTP, the
client machines do take some time to incrementally settle time
adjustments to meet up with the upstreams - they don't just jump time.

> @Micheal Schuler Are you referring to  something like true time as in
> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf?
>  
> Actually I never heard of setting up GPS modules and how that can be
> helpful. Let me research on that but good point.

Nah, I'm talking much simpler. For instance you could do this with a
raspberry pi:
http://www.satsignal.eu/ntp/Raspberry-Pi-NTP.html

-- 
Michael

> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler  > wrote:
> 
> If you require the best precision you can get, setting up a pair of
> stratum 1 ntpd masters in each data center location with a GPS modules
> is not terribly complex. Low latency and jitter on servers you manage.
> 140ms is a long way away network-wise, and I would suggest that was a
> poor choice of upstream (probably stratum 2 or 3) source.
> 
> As Jonathan mentioned, there's no guarantee from Cassandra, but if you
> need as close as you can get, you'll probably need to do it yourself.
> 
> (I run several stratum 2 ntpd servers for pool.ntp.org
> )
> 
> --
> Kind regards,
> Michael
> 
> On 02/09/2017 06:47 PM, Kant Kodali wrote:
> > Hi Justin,
> >
> > There are bunch of issues w.r.t to synchronization of clocks when we
> > used ntpd. Also the time it took to sync the clocks was approx 140ms
> > (don't quote me on it though because it is reported by our devops :)
> >
> > we have multiple clients (for example bunch of micro services are
> > reading from Cassandra) I am not sure how one can achieve
> > Linearizability by setting timestamps on the clients ? since there is no
> > total ordering across multiple clients.
> >
> > Thanks!
> >
> >
> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron  
> > >> wrote:
> >
> > Hi Kant,
> >
> > Clock synchronization is important - you should ensure that ntpd is
> > properly configured on all nodes. If your particular use case is
> > especially sensitive to out-of-order mutations it is possible to set
> > timestamps on the client side using the
> > drivers. 
> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
> 
> 
> > 
>  
> >
> >
> > We use our own NTP cluster to reduce clock drift as much as
> > possible, but public NTP servers are good enough for most
> > uses. 
> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/
> 
> 
> > 
>  
> >
> >
> > Cheers,
> > Justin
> >
> > On Thu, 9 Feb 2017 at 16:09 Kant Kodali  
> > >> wrote:
> >
> > How does Cassandra achieve Linearizability with “Last write
> > wins” (conflict resolution methods based on time-of-day clocks) 
> ?
> >
> > Relying on synchronized clocks are almost certainly
> > non-linearizable, because clock timestamps cannot be guaranteed
> > to be consistent with actual event ordering due to clock skew.
> > isn't it?
> >
> > Thanks!
> >
> > --
> >
> > Justin Cameron
> >
> > Senior Software Engineer | Instaclustr
> >
> >
> >
> >
> > This email has been sent on behalf of Instaclustr Pty Ltd
> > (Australia) and Instaclustr Inc 

Re: How does cassandra achieve Linearizability?

2017-02-09 Thread Kant Kodali
@Justin I read this article
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0.
And it clearly says Linearizable consistency can be achieved with LWT's.
 so should I assume the Linearizability in the context of the above article
is possible with LWT's and synchronization of clocks through ntpd ? because
LWT's also follow Last Write Wins. isn't it? Also another question does
most of the production clusters do setup ntpd? If so what is the time it
takes to sync? any idea

@Micheal Schuler Are you referring to  something like true time as in
https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf?
Actually I never heard of setting up GPS modules and how that can be
helpful. Let me research on that but good point.

On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler 
wrote:

> If you require the best precision you can get, setting up a pair of
> stratum 1 ntpd masters in each data center location with a GPS modules
> is not terribly complex. Low latency and jitter on servers you manage.
> 140ms is a long way away network-wise, and I would suggest that was a
> poor choice of upstream (probably stratum 2 or 3) source.
>
> As Jonathan mentioned, there's no guarantee from Cassandra, but if you
> need as close as you can get, you'll probably need to do it yourself.
>
> (I run several stratum 2 ntpd servers for pool.ntp.org)
>
> --
> Kind regards,
> Michael
>
> On 02/09/2017 06:47 PM, Kant Kodali wrote:
> > Hi Justin,
> >
> > There are bunch of issues w.r.t to synchronization of clocks when we
> > used ntpd. Also the time it took to sync the clocks was approx 140ms
> > (don't quote me on it though because it is reported by our devops :)
> >
> > we have multiple clients (for example bunch of micro services are
> > reading from Cassandra) I am not sure how one can achieve
> > Linearizability by setting timestamps on the clients ? since there is no
> > total ordering across multiple clients.
> >
> > Thanks!
> >
> >
> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron  > > wrote:
> >
> > Hi Kant,
> >
> > Clock synchronization is important - you should ensure that ntpd is
> > properly configured on all nodes. If your particular use case is
> > especially sensitive to out-of-order mutations it is possible to set
> > timestamps on the client side using the
> > drivers. https://docs.datastax.com/en/developer/java-driver/3.1/
> manual/query_timestamps/
> >  manual/query_timestamps/>
> >
> > We use our own NTP cluster to reduce clock drift as much as
> > possible, but public NTP servers are good enough for most
> > uses. https://www.instaclustr.com/blog/2015/11/05/apache-
> cassandra-synchronization/
> >  cassandra-synchronization/>
> >
> > Cheers,
> > Justin
> >
> > On Thu, 9 Feb 2017 at 16:09 Kant Kodali  > > wrote:
> >
> > How does Cassandra achieve Linearizability with “Last write
> > wins” (conflict resolution methods based on time-of-day clocks) ?
> >
> > Relying on synchronized clocks are almost certainly
> > non-linearizable, because clock timestamps cannot be guaranteed
> > to be consistent with actual event ordering due to clock skew.
> > isn't it?
> >
> > Thanks!
> >
> > --
> >
> > Justin Cameron
> >
> > Senior Software Engineer | Instaclustr
> >
> >
> >
> >
> > This email has been sent on behalf of Instaclustr Pty Ltd
> > (Australia) and Instaclustr Inc (USA).
> >
> > This email and any attachments may contain confidential and legally
> > privileged information.  If you are not the intended recipient, do
> > not copy or disclose its content, but please reply to this email
> > immediately and highlight the error to the sender and then
> > immediately delete the message.
> >
> >
>
>


Re: How does cassandra achieve Linearizability?

2017-02-09 Thread Michael Shuler
If you require the best precision you can get, setting up a pair of
stratum 1 ntpd masters in each data center location with a GPS modules
is not terribly complex. Low latency and jitter on servers you manage.
140ms is a long way away network-wise, and I would suggest that was a
poor choice of upstream (probably stratum 2 or 3) source.

As Jonathan mentioned, there's no guarantee from Cassandra, but if you
need as close as you can get, you'll probably need to do it yourself.

(I run several stratum 2 ntpd servers for pool.ntp.org)

-- 
Kind regards,
Michael

On 02/09/2017 06:47 PM, Kant Kodali wrote:
> Hi Justin,
> 
> There are bunch of issues w.r.t to synchronization of clocks when we
> used ntpd. Also the time it took to sync the clocks was approx 140ms
> (don't quote me on it though because it is reported by our devops :) 
> 
> we have multiple clients (for example bunch of micro services are
> reading from Cassandra) I am not sure how one can achieve
> Linearizability by setting timestamps on the clients ? since there is no
> total ordering across multiple clients.
> 
> Thanks!
> 
> 
> On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron  > wrote:
> 
> Hi Kant,
> 
> Clock synchronization is important - you should ensure that ntpd is
> properly configured on all nodes. If your particular use case is
> especially sensitive to out-of-order mutations it is possible to set
> timestamps on the client side using the
> drivers. 
> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
> 
> 
> 
> We use our own NTP cluster to reduce clock drift as much as
> possible, but public NTP servers are good enough for most
> uses. 
> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/
> 
> 
> 
> Cheers,
> Justin
> 
> On Thu, 9 Feb 2017 at 16:09 Kant Kodali  > wrote:
> 
> How does Cassandra achieve Linearizability with “Last write
> wins” (conflict resolution methods based on time-of-day clocks) ?
> 
> Relying on synchronized clocks are almost certainly
> non-linearizable, because clock timestamps cannot be guaranteed
> to be consistent with actual event ordering due to clock skew.
> isn't it?
> 
> Thanks!
> 
> -- 
> 
> Justin Cameron
> 
> Senior Software Engineer | Instaclustr
> 
> 
> 
> 
> This email has been sent on behalf of Instaclustr Pty Ltd
> (Australia) and Instaclustr Inc (USA).
> 
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do
> not copy or disclose its content, but please reply to this email
> immediately and highlight the error to the sender and then
> immediately delete the message.
> 
> 



Re: How does cassandra achieve Linearizability?

2017-02-09 Thread Justin Cameron
I think the answer to that question will depend on your specific use case
and requirements.

If you're only doing a small number of updates but need to be sure they are
applied in order you may be able to use lightweight transactions (keep in
mind there's a performance hit here, so it's not an answer for high-volume
mutations).

For high-volume updates you could look at using an append-only time-series
style data model, using a default TTL to drop old data.

If your data isn't time-series in nature and has a high-volume of updates
then you really just need to make sure either your clients or Cassandra
nodes (preferably both) are in sync.

Justin

On Thu, 9 Feb 2017 at 16:47 Kant Kodali  wrote:

> Hi Justin,
>
> There are bunch of issues w.r.t to synchronization of clocks when we used
> ntpd. Also the time it took to sync the clocks was approx 140ms (don't
> quote me on it though because it is reported by our devops :)
>
> we have multiple clients (for example bunch of micro services are reading
> from Cassandra) I am not sure how one can achieve Linearizability by
> setting timestamps on the clients ? since there is no total ordering across
> multiple clients.
>
> Thanks!
>
>
> On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron 
> wrote:
>
> Hi Kant,
>
> Clock synchronization is important - you should ensure that ntpd is
> properly configured on all nodes. If your particular use case is especially
> sensitive to out-of-order mutations it is possible to set timestamps on the
> client side using the drivers.
> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
>
> We use our own NTP cluster to reduce clock drift as much as possible, but
> public NTP servers are good enough for most uses.
> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/
>
> Cheers,
> Justin
>
> On Thu, 9 Feb 2017 at 16:09 Kant Kodali  wrote:
>
> How does Cassandra achieve Linearizability with “Last write wins”
> (conflict resolution methods based on time-of-day clocks) ?
>
> Relying on synchronized clocks are almost certainly non-linearizable,
> because clock timestamps cannot be guaranteed to be consistent with actual
> event ordering due to clock skew. isn't it?
>
> Thanks!
>
> --
>
> Justin Cameron
>
> Senior Software Engineer | Instaclustr
>
>
>
>
> This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
> Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>
> --

Justin Cameron

Senior Software Engineer | Instaclustr




This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


Re: How does cassandra achieve Linearizability?

2017-02-09 Thread Jonathan Haddad
It doesn't, nor does it claim to.

On Thu, Feb 9, 2017 at 4:09 PM Kant Kodali  wrote:

> How does Cassandra achieve Linearizability with “Last write wins”
> (conflict resolution methods based on time-of-day clocks) ?
>
> Relying on synchronized clocks are almost certainly non-linearizable,
> because clock timestamps cannot be guaranteed to be consistent with actual
> event ordering due to clock skew. isn't it?
>
> Thanks!
>


Re: How does cassandra achieve Linearizability?

2017-02-09 Thread Kant Kodali
Hi Justin,

There are bunch of issues w.r.t to synchronization of clocks when we used
ntpd. Also the time it took to sync the clocks was approx 140ms (don't
quote me on it though because it is reported by our devops :)

we have multiple clients (for example bunch of micro services are reading
from Cassandra) I am not sure how one can achieve Linearizability by
setting timestamps on the clients ? since there is no total ordering across
multiple clients.

Thanks!


On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron 
wrote:

> Hi Kant,
>
> Clock synchronization is important - you should ensure that ntpd is
> properly configured on all nodes. If your particular use case is especially
> sensitive to out-of-order mutations it is possible to set timestamps on the
> client side using the drivers. https://docs.datastax.com/en/developer/
> java-driver/3.1/manual/query_timestamps/
>
> We use our own NTP cluster to reduce clock drift as much as possible, but
> public NTP servers are good enough for most uses. https://www.instaclustr.
> com/blog/2015/11/05/apache-cassandra-synchronization/
>
> Cheers,
> Justin
>
> On Thu, 9 Feb 2017 at 16:09 Kant Kodali  wrote:
>
>> How does Cassandra achieve Linearizability with “Last write wins”
>> (conflict resolution methods based on time-of-day clocks) ?
>>
>> Relying on synchronized clocks are almost certainly non-linearizable,
>> because clock timestamps cannot be guaranteed to be consistent with actual
>> event ordering due to clock skew. isn't it?
>>
>> Thanks!
>>
> --
>
> Justin Cameron
>
> Senior Software Engineer | Instaclustr
>
>
>
>
> This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
> Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>


Re: How does cassandra achieve Linearizability?

2017-02-09 Thread Justin Cameron
Hi Kant,

Clock synchronization is important - you should ensure that ntpd is
properly configured on all nodes. If your particular use case is especially
sensitive to out-of-order mutations it is possible to set timestamps on the
client side using the drivers.
https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/

We use our own NTP cluster to reduce clock drift as much as possible, but
public NTP servers are good enough for most uses.
https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/

Cheers,
Justin

On Thu, 9 Feb 2017 at 16:09 Kant Kodali  wrote:

> How does Cassandra achieve Linearizability with “Last write wins”
> (conflict resolution methods based on time-of-day clocks) ?
>
> Relying on synchronized clocks are almost certainly non-linearizable,
> because clock timestamps cannot be guaranteed to be consistent with actual
> event ordering due to clock skew. isn't it?
>
> Thanks!
>
-- 

Justin Cameron

Senior Software Engineer | Instaclustr




This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.