On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote:
On Wed, Feb 04, 2015 at 04:30:32PM +0000, Matthew Booth wrote:
I've spent a few hours today reading about Galera, a clustering solution
for MySQL. Galera provides multi-master 'virtually synchronous'
replication between multiple mysql nodes. i.e. I can create a cluster of
3 mysql dbs and read and write from any of them with certain consistency
guarantees.

I am no expert[1], but this is a TL;DR of a couple of things which I
didn't know, but feel I should have done. The semantics are important to
application design, which is why we should all be aware of them.


* Commit will fail if there is a replication conflict

foo is a table with a single field, which is its primary key.

A: start transaction;
B: start transaction;
A: insert into foo values(1);
B: insert into foo values(1); <-- 'regular' DB would block here, and
                                   report an error on A's commit
A: commit; <-- success
B: commit; <-- KABOOM

Confusingly, Galera will report a 'deadlock' to node B, despite this not
being a deadlock by any definition I'm familiar with.

It is a failure to certify the writeset, which bubbles up as an InnoDB deadlock error. See my article here:

http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/

Which explains this.

Yes ! and if I can add more information and I hope I do not make
mistake I think it's a know issue which comes from MySQL, that is why
we have a decorator to do a retry and so handle this case here:

   
http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177

It's not an issue with MySQL. It's an issue with any database code that is highly contentious.

Almost all highly distributed or concurrent applications need to handle deadlock issues, and the most common way to handle deadlock issues on database records is using a retry technique. There's nothing new about that with Galera.

The issue with our use of the @_retry_on_deadlock decorator is *not* that the retry decorator is not needed, but rather it is used too frequently. The compare-and-swap technique I describe in the article above dramatically* reduces the number of deadlocks that occur (and need to be handled by the @_retry_on_deadlock decorator) and dramatically reduces the contention over critical database sections.

Best,
-jay

* My colleague Pavel Kholkin is putting together the results of a benchmark run that compares the compare-and-swap method with the raw @_retry_on_deadlock decorator method. Spoiler: the compare-and-swap method cuts the runtime of the benchmark by almost *half*.

Essentially, anywhere that a regular DB would block, Galera will not
block transactions on different nodes. Instead, it will cause one of the
transactions to fail on commit. This is still ACID, but the semantics
are quite different.

The impact of this is that code which makes correct use of locking may
still fail with a 'deadlock'. The solution to this is to either fail the
entire operation, or to re-execute the transaction and all its
associated code in the expectation that it won't fail next time.

As I understand it, these can be eliminated by sending all writes to a
single node, although that obviously makes less efficient use of your
cluster.


* Write followed by read on a different node can return stale data

During a commit, Galera replicates a transaction out to all other db
nodes. Due to its design, Galera knows these transactions will be
successfully committed to the remote node eventually[2], but it doesn't
commit them straight away. The remote node will check these outstanding
replication transactions for write conflicts on commit, but not for
read. This means that you can do:

A: start transaction;
A: insert into foo values(1)
A: commit;
B: select * from foo; <-- May not contain the value we inserted above[3]

This means that even for 'synchronous' slaves, if a client makes an RPC
call which writes a row to write master A, then another RPC call which
expects to read that row from synchronous slave node B, there's no
default guarantee that it'll be there.

Galera exposes a session variable which will fix this: wsrep_sync_wait
(or wsrep_causal_reads on older mysql). However, this isn't the default.
It presumably has a performance cost, but I don't know what it is, or
how it scales with various workloads.


Because these are semantic issues, they aren't things which can be
easily guarded with an if statement. We can't say:

if galera:
   try:
     commit
   except:
     rewind time

If we are to support this DB at all, we have to structure code in the
first place to allow for its semantics.

Matt

[1] No, really: I just read a bunch of docs and blogs today. If anybody
who is an expert would like to validate/correct that would be great.

[2]
http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/

[3]
http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
--
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to