Hi Ariel, Can we really expect the fix in 3.11.x as the ticket https://issues.apache. org/jira/browse/CASSANDRA-6246 <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22> says?
Thanks, kant On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: > Hi, > > That would work and would help a lot with the dueling proposer issue. > > A lot of the leader election stuff is designed to reduce the number of > roundtrips and not just address the dueling proposer issue. Those will have > downtime because it's there for correctness. Just adding an affinity for a > specific proposer is probably a free lunch. > > I don't think you can group keys because the Paxos proposals are per > partition which is why we get linear scale out for Paxos. I don't believe > it's linearizable across multiple partitions. You can use the clustering > key and deterministically pick one of the live replicas for that clustering > key. Sort the list of replicas by IP, hash the clustering key, use the hash > as an index into the list of replicas. > > Batching is of limited usefulness because we only use Paxos for CAS I > think? So in a batch by definition all but one will fail the CAS. This is > something where a distinguished coordinator could help by failing the rest > of the contending requests more inexpensively than it currently does. > > Ariel > On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote: > > > > On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: > > > Hi, > > Classic Paxos doesn't have a leader. There are variants on the original > Lamport approach that will elect a leader (or some other variation like > Mencius) to improve throughput, latency, and performance under contention. > Cassandra implements the approach from the beginning of "Paxos Made Simple" > (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware > of. There is no distinguished proposer (leader). > > That paper does go on to discuss electing a distinguished proposer, but > that was never done for C*. I believe it's not considered a good fit for C* > philosophically. > > Ariel > > On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote: > > @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't > need any designated leader for C* but I am assuming the paxos that is > implemented today for LWT's requires Leader election and If so, don't we > need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1 > constraint to tolerate F failures ? I understand it is not needed when not > using LWT's since Cassandra is a master-less system. > > On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali <k...@peernova.com> wrote: > > Thanks Ariel! Yes I knew there are so many variations and optimizations of > Paxos. I just wanted to see if we had any plans on improving the existing > Paxos implementation and it is great to see the work is under progress! I > am going to follow that ticket and read up the references pointed in it > > > On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: > > > Hi, > > Cassandra's implementation of Paxos doesn't implement many optimizations > that would drastically improve throughput and latency. You need consensus, > but it doesn't have to be exorbitantly expensive and fall over under any > kind of contention. > > For instance you could implement EPaxos https://issues.apache.o > rg/jira/browse/CASSANDRA-6246 > <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>, > batch multiple operations into the same Paxos round, have an affinity for a > specific proposer for a specific partition, implement asynchronous commit, > use a more efficient implementation of the Paxos log, and maybe other > things. > > > Ariel > > > > On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote: > > Hi Kant, > > If you read the published papers about Paxos, you will most probably > recognize that there is no way to "do it better". This is a conceptional > thing due to the nature of distributed systems + the CAP theorem. > If you want A+P in the triangle, then C is very expensive. CS is made for > A+P mostly with tunable C. In ACID databases this is a completely different > thing as they are mostly either not partition tolerant, not highly > available or not scalable (in a distributed manner, not speaking of > "monolithic super servers"). > > There is no free lunch ... > > > 2017-02-10 11:09 GMT+01:00 Kant Kodali <k...@peernova.com>: > > "That’s the safety blanket everyone wants but is extremely expensive, > especially in Cassandra." > > yes LWT's are expensive. Are there any plans to make this better? > > On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali <k...@peernova.com> wrote: > > Hi Jon, > > Thanks a lot for your response. I am well aware that the LWW != LWT but I > was talking more in terms of LWW with respective to LWT's which I believe > you answered. so thanks much! > > > kant > > > On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad <jonathan.had...@gmail.com> > wrote: > > LWT != Last Write Wins. They are totally different. > > LWTs give you (assuming you also read at SERIAL) “atomic consistency”, > meaning you are able to perform operations atomically and in isolation. > That’s the safety blanket everyone wants but is extremely expensive, > especially in Cassandra. The lightweight part, btw, may be a little > optimistic, especially if a key is under contention. With regard to the > “last write” part you’re asking about - w/ LWT Cassandra provides the > timestamp and manages it as part of the ballot, and it always is > increasing. See > org.apache.cassandra.service.ClientState#getTimestampForPaxos. > From the code: > > * Returns a timestamp suitable for paxos given the timestamp of the last > known commit (or in progress update). > * Paxos ensures that the timestamp it uses for commits respects the > serial order of those commits. It does so > * by having each replica reject any proposal whose timestamp is not > strictly greater than the last proposal it > * accepted. So in practice, which timestamp we use for a given proposal > doesn't affect correctness but it does > * affect the chance of making progress (if we pick a timestamp lower than > what has been proposed before, our > * new proposal will just get rejected). > > Effectively paxos removes the ability to use custom timestamps and > addresses clock variance by rejecting ballots with timestamps less than > what was last seen. You can learn more by reading through the other > comments and code in that file. > > Last write wins is a free for all that guarantees you *nothing* except the > timestamp is used as a tiebreaker. Here we acknowledge things like the > speed of light as being a real problem that isn’t going away anytime soon. > This problem is sometimes addressed with event sourcing rather than > mutating in place. > > Hope this helps. > > > Jon > > > > > On Feb 9, 2017, at 5:21 PM, Kant Kodali <k...@peernova.com> wrote: > > @Justin I read this article http://www.datastax.com/dev/bl > og/lightweight-transactions-in-cassandra-2-0. And it clearly says > Linearizable consistency can be achieved with LWT's. so should I assume > the Linearizability in the context of the above article is possible with > LWT's and synchronization of clocks through ntpd ? because LWT's also > follow Last Write Wins. isn't it? Also another question does most of the > production clusters do setup ntpd? If so what is the time it takes to sync? > any idea > > @Micheal Schuler Are you referring to something like true time as in > https://static.googleusercontent.com/media/research.google.c > om/en//archive/spanner-osdi2012.pdf? Actually I never heard of setting > up GPS modules and how that can be helpful. Let me research on that but > good point. > > On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler <mich...@pbandjelly.org> > wrote: > > If you require the best precision you can get, setting up a pair of > stratum 1 ntpd masters in each data center location with a GPS modules > is not terribly complex. Low latency and jitter on servers you manage. > 140ms is a long way away network-wise, and I would suggest that was a > poor choice of upstream (probably stratum 2 or 3) source. > > As Jonathan mentioned, there's no guarantee from Cassandra, but if you > need as close as you can get, you'll probably need to do it yourself. > > (I run several stratum 2 ntpd servers for pool.ntp.org) > > -- > Kind regards, > Michael > > On 02/09/2017 06:47 PM, Kant Kodali wrote: > > Hi Justin, > > > > There are bunch of issues w.r.t to synchronization of clocks when we > > used ntpd. Also the time it took to sync the clocks was approx 140ms > > (don't quote me on it though because it is reported by our devops :) > > > > we have multiple clients (for example bunch of micro services are > > reading from Cassandra) I am not sure how one can achieve > > Linearizability by setting timestamps on the clients ? since there is no > > total ordering across multiple clients. > > > > Thanks! > > > > > > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron <jus...@instaclustr.com > > <mailto:jus...@instaclustr.com>> wrote: > > > > Hi Kant, > > > > Clock synchronization is important - you should ensure that ntpd is > > properly configured on all nodes. If your particular use case is > > especially sensitive to out-of-order mutations it is possible to set > > timestamps on the client side using the > > drivers. https://docs.datastax.com/en/d > eveloper/java-driver/3.1/manual/query_timestamps/ > > <https://docs.datastax.com/en/developer/java-driver/3.1/man > ual/query_timestamps/> > > > > We use our own NTP cluster to reduce clock drift as much as > > possible, but public NTP servers are good enough for most > > uses. https://www.instaclustr.com/blog/2015/11/05/apache-cassandra > -synchronization/ > > <https://www.instaclustr.com/blog/2015/11/05/apache-cassand > ra-synchronization/> > > > > Cheers, > > Justin > > > > On Thu, 9 Feb 2017 at 16:09 Kant Kodali <k...@peernova.com > > <mailto:k...@peernova.com>> wrote: > > > > How does Cassandra achieve Linearizability with “Last write > > wins” (conflict resolution methods based on time-of-day clocks) ? > > > > Relying on synchronized clocks are almost certainly > > non-linearizable, because clock timestamps cannot be guaranteed > > to be consistent with actual event ordering due to clock skew. > > isn't it? > > > > Thanks! > > > > -- > > > > Justin Cameron > > > > Senior Software Engineer | Instaclustr > > > > > > > > > > This email has been sent on behalf of Instaclustr Pty Ltd > > (Australia) and Instaclustr Inc (USA). > > > > This email and any attachments may contain confidential and legally > > privileged information. If you are not the intended recipient, do > > not copy or disclose its content, but please reply to this email > > immediately and highlight the error to the sender and then > > immediately delete the message. > > > > > > > > > > > > > > -- > Benjamin Roth > Prokurist > > Jaumo GmbH · www.jaumo.com > Wehrstraße 46 · 73035 Göppingen · Germany > Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1 > <+49%207161%203048801> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer > > > > > > > One thing that always bothered me: Intelligent clients and dynamic snitch > are designed to attempt to route requests to the same node to attempt to > take advantage of cache pinning etc. You would think under these conditions > one could naturally elect a "leader" for a "group" of keys that could > persist for a few hundred milliseconds and batch up the round trips for a > number of operations. Maybe that is what the distinguished coordinator is > in some regards. > > >