Hi Angus and everyone, I would like to reply for a couple of things: - The behavior of overlapping transactions is dependent on the transaction isolation level, even in the case of the single server, for any database. This was pointed out by others earlier as well.
- The deadlock error from Galera can be confusing, but the point is that the application can actually threat this as a deadlock (or apply any kind of retry logic, which it would apply to a failed transaction), I don't know if it would be even more confusing from the developer's point of view, if it would say brute force error. Transactions can fail in a database, in the initial example the transaction will fail with a duplicate key error. The result is pretty much the same from the application's perspective, the transaction was not successful (it failed as a block), the application should handle the failure. There can be a lot more reasons for a transaction to fail regardless of the database engine, some of these failures are persistent (for example the disk is full underneath the database), and some of these are intermittent in nature like the case above. A good retry mechanism can be good for handling the intermittent failures, depending on the application logic. - Like many others said it before me, consistent reads can be achieved with wsrep_causal_reads set on in the session. I can shed some light on how this works. Nodes in galera are participating in a group communication. A global order of the transactions are established as part of this. Since the global order of the transaction is known, a session with wsrep_causal_reads on will put a "marker" in the local replication queue. Because transaction ordering is global, the session will be simply blocked until all the other transactions are processed in the replication queue before that marker. So, setting wsrep_causal_reads imposes additional latency only for the given select we are using it on (it literally just waits the queue to be processed up to the current transaction). So because of this, manual checking of the global transaction ids is not necessary. - On synchronous replication: galera only transmits the data synchronously, it doesn't do synchronous apply. A transaction is sent in parallel to the rest of the cluster nodes (to be accurate, it's only sent to the nodes that are in the same group segment, but it waits until all the group segments get the data). Once the other nodes received it, the transaction commits locally, the others will apply it later. The cluster can do this because of certification and because certification is deterministic (the result of the certification will be the same on all nodes, otherwise, the nodes have a different state, for example one of them was written locally). The replication uses write sets, which is practically row based mysql binary log event and some metadata. The some metadata is good for 2 things: you can take a look at 2 write sets and tell if they are conflicting or not, and you can decide if a write set is applicable to a database. Because this is checked at certification time, the apply part can be parallel (because of the certification, it's guaranteed that the transactions are not conflicting). When it comes to consistency and replication speed, there are no wonders, there are tradeoffs to make. Two phase commit is relatively slow, distributed locking is relatively slow, this is a lot faster, but the application should handle transaction failures (which it should probably handle anyway). Here is the xtradb cluster documentation (Percona Server with galera): http://www.percona.com/doc/percona-xtradb-cluster/5.6/#user-s-manual Here is the multi-master replication part of the documentation: http://www.percona.com/doc/percona-xtradb-cluster/5.6/features/multimaster-replication.html On Fri, Feb 6, 2015 at 3:36 AM, Angus Lees <g...@inodes.org> wrote: > On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes <g...@greghaynes.net> > wrote: >> >> Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +0000: >> > Angus Lees wrote: >> > > On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum <cl...@fewbar.com >> > > <mailto:cl...@fewbar.com>> wrote: >> > > I'd also like to see consideration given to systems that handle >> > > distributed consistency in a more active manner. etcd and >> > > Zookeeper are >> > > both such systems, and might serve as efficient guards for >> > > critical >> > > sections without raising latency. >> > > >> > > >> > > +1 for moving to such systems. Then we can have a repeat of the above >> > > conversation without the added complications of SQL semantics ;) >> > > >> > >> > So just an fyi: >> > >> > http://docs.openstack.org/developer/tooz/ exists. >> > >> > Specifically: >> > >> > >> > http://docs.openstack.org/developer/tooz/developers.html#tooz.coordination.CoordinationDriver.get_lock >> > >> > It has a locking api that it provides (that plugs into the various >> > backends); there is also a WIP https://review.openstack.org/#/c/151463/ >> > driver that is being worked for etc.d. >> > >> >> An interesting note about the etcd implementation is that you can >> select per-request whether you want to wait for quorum on a read or not. >> This means that in theory you could obtain higher throughput for most >> operations which do not require this and then only gain quorum for >> operations which require it (e.g. locks). > > > Along those lines and in an effort to be a bit less doom-and-gloom, I spent > my lunch break trying to find non-marketing documentation on the Galera > replication protocol and how it is exposed. (It was surprisingly difficult > to find such information *) > > It's easy to get the transaction ID of the last commit > (wsrep_last_committed), but I can't find a way to wait until at least a > particular transaction ID has been synced. If we can find that latter > functionality, then we can expose that sequencer all the way through (HTTP > header?) and then any follow-on commands can mention the sequencer of the > previous write command that they really need to see the effects of. > > In practice, this should lead to zero additional wait time, since the Galera > replication has almost certainly already caught up by the time the second > command comes in - and we can just read from the local server with no > additional delay. > > See the various *Index variables in the etcd API, for how the same idea gets > used there. > > - Gus > > (*) In case you're also curious, the only doc I found with any details was > http://galeracluster.com/documentation-webpages/certificationbasedreplication.html > and its sibling pages. > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Peter Boros, Principal Architect, Percona Telephone: +1 888 401 3401 ext 546 Emergency: +1 888 401 3401 ext 911 Skype: percona.pboros __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev