Hello,

In 0.9 CouchDB removed the transactional bulk docs feature in favour
of simplifying sharding/replication.

The priorities behind this decision as I understood them are:

    1. ensure that applications developed in a single server don't
suffer from a degradation of guarantees if deployed using sharding

    2. avoid the issues involving transactional


I apologize if this proposal has already dismissed before. I did
search the mailing list archives, but mostly found a discussion on why
this stuff should not be done on IRC. I blame Jan for encouraging me
to post ;-)



So anyway, I think that we can have both features without needing to
implement something like Paxos, and without silently breaking apps
when they move to a sharding setup from a single machine setup.


The basic idea is to keep the conflicts-are-data approach, keeping the
current user visible replication and conflict resolution, but to allow
the user to ask for stricter conflict checking.

The api I propose is for the bulk docs operation to have an optional
'atomic' flag. When this flag is set CouchDB would atomically verify
that all documents were committed without conflict (with respect to
the supplied _rev), and if any one document conflicts, mark all of
them as conflicting.

Transaction recovery, conflict resolution etc is still the
responsibility of the application, but provides an atomic guarantee
that an inconsistent transaction will fail as a whole if it tries to
write inconsistent data to the database, a guarantee that cannot be
made using a client library (there are race conditions).



Now the hard parts:


1. Replication

The way I understand it replication currently works on a per document
approach. If 'atomic' was specified in a bulk operation I propose that
all the revisions created in that bulk operation were kept linked. If
these linked revisions are being replicated, the same conflict
resolution must be applied (the replication of the document itself is
executed as bulk operation with aotmic=true, replicating all
associated documents as well).

The caveat is that even if you always use bulk docs with the atomic
flag, if you a switch replica you could lose the D out of ACID:
documents which are marked as non conflicting in your replica might be
conflicting in the replica you switch to, in which case transactions
that have already been committed appear to be rolled back from the
application's point of view.

This problem obviously already exists in the current implementation,
but when 'atomic' is specified it could potentially happen a lot more
often.


2. Data sharding

This one is tricker. Two solutions both of which I think are
acceptable, and either or both of which could be used:


The easy way is to ignore this problem. Well not really: The client
must ensure that all the documents affected by a single transaction
are in the same shard, by using a partitioning scheme that allows
this.

If a bulk_docs operation with atomic set to true would affect multiple
shards, that is an error (the data could still be written as a
conflict, of course).

If you want to enable the 'atomic' flag you'll need to be careful
about how you use sharding. You can still use it for some of the
transactions, but not all the time. I think this is a flexible and
pragmatic solution.

This means that if you choose to opt in to fully atomic bulk doc
operations your app might not be deployable unmodified to a sharded
setup, but it's never unsafe (no data inconsistencies).

In my opinion this satisfies the requirement for no degredation of
guarantees. It might not Just Work, but you can't have your cake and
eat it too at the end of the day.




The second way is harder but potentially still interesting. I've
included it mostly for the sake of discussion.

The core idea is to provide low level primitives on top of which a
client or proxy can implement a multi phase commit protocol.

The number of nodes involved is in the transaction depends on the data
in the transaction (it doesn't need to coordinate all the nodes in the
cluster).

Basically this would breakdown bulk doc calls into several steps.
First all the data is inserted to the backend, but it's set as
conflicting so that it's not accidentally visible.

This operation returns an identifier for the bulk doc operation
(essentially a ticket for a prepared transaction).

Once the data is available on all the shards it must be made live
atomically. A two phase commit starts by acquiring locks on all the
the transaction tickets and trying to apply them (the 'promise'
phase), and then finalizing that application atomically (the 'accept'
phase).

To keep things simple the two phases should be scoped to a single keep
alive connection. If the connection drops the locks should be
released.

Obviously Paxos ensues, but here's the catch:

 - The synchronization can be implemented first as a 3rd party
component, it doesn't need to affect CouchDB's core

 - The atomic primitives are also useful for writing safe conflict
resolution tools that handle conflicts that span multiple documents.

So even if no one ends up implementing real Multi Paxos in the end
CouchDB still benefits from having reusable synchronization
primitives. (If this is interesting to anyone, see below [1])




I'd like to stress that this is all possible to do on top of the
current 0.9 semantics. The default behavior in 0.9 does not change at
all. You have to opt in to this more complex behavior.

The problem with 0.9 is that there is no way to ensure atomicity and
isolation from a client library, it must be done on the server, so by
removing the ability to do this at all, couchdb is essentially no
longer transaction. It's protected from internal data corruption, and
it's protected from data loss (unlike say MyISAM which will happily
overwrite your correct data), but it's still a potentially lossy model
since conflicting revisions are not correlated. This means that you
cannot have a transactional graph model, it's either or.



Finally, you should know that I have no personal stake in this. I
don't rely on CouchDB (yet), but I think it's somewhat relevant for a
project I work on, and that the requirements for this project are not
that far fetched. I'm the author of an object graph storage engine for
Perl called KiokuDB. It serializes every object to a document unless
told otherwise but the data is still a highly interconnected graph. As
a user of this system I care a lot about transactions, but not at all
about sharding (this might not hold for all the users of KiokuDB).
Basically I already have what I need from KiokuDB; there are numerous
backends for this system that satisfy me (BerkeleyDB, a transactional
plain file backend, DBI (PostgreSQL, SQLite, MySQL)), and a number
that don't fit my personal needs due to lack of transactions
(SimpleDB, MongoDB, and CouchDB since 0.9).

If things stays this way, then CouchDB is simply not intended for
users like me (though I'll certainly still maintain
KiokuDB::Backend::CouchDB).

However, I do *want* to use CouchDB. I think that under many scenarios
it has clear advantages compared to the other backends (mostly the
fact that it's so easy, but also views support is nice). I think it'd
be a shame if what was preventing me was a fix that ended up being a
low hanging fruit to which no one objected.


Regards,
Yuval



[1] in Paxos terms the CouchDB shards would do the Acceptor role and
the client (be it the actual client or a sharding proxy, whatever
delegates and consolidates the views) performs the the Learner role.
Only the Learner is considered authoritative with respect to the final
status of a transaction.

Hard crashes of a shard during the 'accept' phase may produce
inconsistent results if more than one Learner is used to proxy write
operations. Focusing on this scenario is a *HUGE* pessimization. High
availability of the Learner role can still be achieved using
BerkeleyDB style master failover (voting).

This transactional sharding proxy could of course also guarantee
redundancy of shards.

My point is that Paxos can be implemented as a 3rd party component if
anyone actually wants/needs it, by providing comparatively simple
primitives.

Reply via email to