Hi,


Cassandra's implementation of Paxos doesn't implement many optimizations
that would drastically improve throughput and latency. You need
consensus, but it doesn't have to be exorbitantly expensive and fall
over under any kind of contention.


For instance you could implement EPaxos
https://issues.apache.org/jira/browse/CASSANDRA-6246[1], batch multiple
operations into the same Paxos round, have an affinity for a specific
proposer for a specific partition, implement asynchronous commit, use a
more efficient implementation of the Paxos log, and maybe other things.


Ariel





On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:

> Hi Kant,

> 

> If you read the published papers about Paxos, you will most probably
> recognize that there is no way to "do it better". This is a
> conceptional thing due to the nature of distributed systems + the CAP
> theorem.
> If you want A+P in the triangle, then C is very expensive. CS is made
> for A+P mostly with tunable C. In ACID databases this is a completely
> different thing as they are mostly either not partition tolerant, not
> highly available or not scalable (in a distributed manner, not
> speaking of "monolithic super servers").
> 

> There is no free lunch ...

> 

> 

> 2017-02-10 11:09 GMT+01:00 Kant Kodali <k...@peernova.com>:

>> "That’s the safety blanket everyone wants but is extremely expensive,
>> especially in Cassandra."
>> 

>> yes LWT's are expensive. Are there any plans to make this better? 

>> 

>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali
>> <k...@peernova.com> wrote:
>>> Hi Jon,

>>> 

>>> Thanks a lot for your response. I am well aware that the LWW != LWT
>>> but I was talking more in terms of LWW with respective to LWT's
>>> which I believe you answered. so thanks much!
>>> 

>>> 

>>> kant

>>> 

>>> 

>>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad
>>> <jonathan.had...@gmail.com> wrote:
>>>> LWT != Last Write Wins.  They are totally different.  

>>>> 

>>>> LWTs give you (assuming you also read at SERIAL) “atomic
>>>> consistency”, meaning you are able to perform operations atomically
>>>> and in isolation.  That’s the safety blanket everyone wants but is
>>>> extremely expensive, especially in Cassandra.  The lightweight
>>>> part, btw, may be a little optimistic, especially if a key is under
>>>> contention.  With regard to the “last write” part you’re asking
>>>> about - w/ LWT Cassandra provides the timestamp and manages it as
>>>> part of the ballot, and it always is increasing.  See
>>>> org.apache.cassandra.service.ClientState#getTimestampForPaxos.
>>>> From the code:
>>>> 

>>>>  * Returns a timestamp suitable for paxos given the timestamp of
>>>>    the last known commit (or in progress update).
>>>>  * Paxos ensures that the timestamp it uses for commits respects
>>>>    the serial order of those commits. It does so
>>>>  * by having each replica reject any proposal whose timestamp is
>>>>    not strictly greater than the last proposal it
>>>>  * accepted. So in practice, which timestamp we use for a given
>>>>    proposal doesn't affect correctness but it does
>>>>  * affect the chance of making progress (if we pick a timestamp
>>>>    lower than what has been proposed before, our
>>>>  * new proposal will just get rejected).

>>>> 

>>>> Effectively paxos removes the ability to use custom timestamps and
>>>> addresses clock variance by rejecting ballots with timestamps less
>>>> than what was last seen.  You can learn more by reading through the
>>>> other comments and code in that file.
>>>> 

>>>> Last write wins is a free for all that guarantees you *nothing*
>>>> except the timestamp is used as a tiebreaker.  Here we acknowledge
>>>> things like the speed of light as being a real problem that isn’t
>>>> going away anytime soon.  This problem is sometimes addressed with
>>>> event sourcing rather than mutating in place.
>>>> 

>>>> Hope this helps.

>>>> 

>>>> 

>>>> Jon

>>>> 

>>>> 

>>>> 

>>>> 

>>>>> On Feb 9, 2017, at 5:21 PM, Kant Kodali <k...@peernova.com> wrote:
>>>>> 

>>>>> @Justin I read this article
>>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0.
>>>>> And it clearly says Linearizable consistency can be achieved with
>>>>> LWT's.  so should I assume the Linearizability in the context of
>>>>> the above article is possible with LWT's and synchronization of
>>>>> clocks through ntpd ? because LWT's also follow Last Write Wins.
>>>>> isn't it? Also another question does most of the production
>>>>> clusters do setup ntpd? If so what is the time it takes to sync?
>>>>> any idea
>>>>> 

>>>>> @Micheal Schuler Are you referring to  something like true time as
>>>>> in
>>>>> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf?
>>>>> Actually I never heard of setting up GPS modules and how that can
>>>>> be helpful. Let me research on that but good point.
>>>>> 

>>>>> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler
>>>>> <mich...@pbandjelly.org> wrote:
>>>>>> If you require the best precision you can get, setting up a
>>>>>> pair of
>>>>>>  stratum 1 ntpd masters in each data center location with a GPS
>>>>>>  modules
>>>>>>  is not terribly complex. Low latency and jitter on servers you
>>>>>>  manage.
>>>>>>  140ms is a long way away network-wise, and I would suggest that
>>>>>>  was a
>>>>>>  poor choice of upstream (probably stratum 2 or 3) source.

>>>>>> 

>>>>>>  As Jonathan mentioned, there's no guarantee from Cassandra, but
>>>>>>  if you
>>>>>>  need as close as you can get, you'll probably need to do it
>>>>>>  yourself.
>>>>>> 

>>>>>>  (I run several stratum 2 ntpd servers for pool.ntp.org[2])

>>>>>>
>>>>>>  --
>>>>>>  Kind regards, Michael
>>>>>>
>>>>>>  On 02/09/2017 06:47 PM, Kant Kodali wrote:
>>>>>>  > Hi Justin,
>>>>>>  >
>>>>>>  > There are bunch of issues w.r.t to synchronization of clocks
>>>>>>  > when we used ntpd. Also the time it took to sync the clocks
>>>>>>  > was approx 140ms (don't quote me on it though because it is
>>>>>>  > reported by our devops :)
>>>>>>  >
>>>>>>  > we have multiple clients (for example bunch of micro services
>>>>>>  > are reading from Cassandra) I am not sure how one can achieve
>>>>>>  > Linearizability by setting timestamps on the clients ? since
>>>>>>  > there is no total ordering across multiple clients.
>>>>>>  >
>>>>>>  > Thanks!
>>>>>>  >
>>>>>>  >
>>>>>>  > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron
>>>>>>  > <jus...@instaclustr.com
>>>>>> > <mailto:jus...@instaclustr.com>> wrote:
>>>>>>  >
>>>>>>  >     Hi Kant,
>>>>>>  >
>>>>>>  >     Clock synchronization is important - you should ensure
>>>>>>  >     that ntpd is properly configured on all nodes. If your
>>>>>>  >     particular use case is especially sensitive to out-of-
>>>>>>  >     order mutations it is possible to set timestamps on the
>>>>>>  >     client side using the drivers.
>>>>>>  >     
>>>>>> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
>>>>>>  >     
>>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/>
>>>>>>  >
>>>>>>  >     We use our own NTP cluster to reduce clock drift as much
>>>>>>  >     as possible, but public NTP servers are good enough for
>>>>>>  >     most uses.
>>>>>>  >     
>>>>>> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/
>>>>>>  >     
>>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/>
>>>>>>  >
>>>>>>  >     Cheers, Justin
>>>>>>  >
>>>>>>  >     On Thu, 9 Feb 2017 at 16:09 Kant Kodali <k...@peernova.com
>>>>>> >     <mailto:k...@peernova.com>> wrote:
>>>>>>  >
>>>>>>  >         How does Cassandra achieve Linearizability with “Last
>>>>>>  >         write wins” (conflict resolution methods based on time-of-
>>>>>>  >         day clocks) ?
>>>>>>  >
>>>>>>  >         Relying on synchronized clocks are almost certainly
>>>>>>  >         non-linearizable, because clock timestamps cannot be
>>>>>>  >         guaranteed to be consistent with actual event ordering
>>>>>>  >         due to clock skew. isn't it?
>>>>>>  >
>>>>>>  >         Thanks!
>>>>>>  >
>>>>>>  >     --
>>>>>>  >
>>>>>>  >     Justin Cameron
>>>>>>  >
>>>>>>  >     Senior Software Engineer | Instaclustr
>>>>>>  >
>>>>>>  >
>>>>>>  >
>>>>>>  >
>>>>>> >     This email has been sent on behalf of Instaclustr Pty Ltd

>>>>>>  >     (Australia) and Instaclustr Inc (USA).

>>>>>>  >

>>>>>>  >     This email and any attachments may contain confidential
>>>>>>  >     and legally
>>>>>>  >     privileged information.  If you are not the intended
>>>>>>  >     recipient, do
>>>>>>  >     not copy or disclose its content, but please reply to this
>>>>>>  >     email
>>>>>>  >     immediately and highlight the error to the sender and then
>>>>>>  >     immediately delete the message.

>>>>>>  >

>>>>>>  >

>>>>>> 

>>>>> 

>>>> 

>>> 

>> 

> 

> 

> 

> -- 

> Benjamin Roth

> Prokurist

> 

> Jaumo GmbH · www.jaumo.com

> Wehrstraße 46 · 73035 Göppingen · Germany

> Phone +49 7161 304880-6 · Fax +49 7161 304880-1

> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer




Links:

  1. 
https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22
  2. http://pool.ntp.org/

Reply via email to