Re: reiterating transactions vs. replication

2009-05-25 Thread Randall Leeds
On Mon, May 25, 2009 at 01:44, Scott Shumaker  wrote:

> If that's the case, it does explain why there are some decisions in
> CouchDB I find strange, like the fact that authentication runs during
> replication.  You're using replication for a completely different
> purpose than I need it for - I need it for redundancy and read
> scaling, you're using it to synchronize disparate data sources.  Very
> different problems that call for very different solutions.


I called out this distinction between usages in my message above.
If the replication functionality were "directed" or overseen by the
sharding/partitioning code in the data center deployment, I think we can
totally prevent replication conflicts *caused by* the partitioning code.
However, replicating to/from the data center could still create conflicts in
the normal way.


Re: reiterating transactions vs. replication

2009-05-25 Thread Yuval Kogman
2009/5/25 Antony Blakey :

> There is another solution to this problem that I posted to this list about
> 16 March 2009. In short:

Personally at that point I would reassess my usage of CouchDB
altogether, it just doesn't seem appropriate anymore. Whether or not
couchdb should be changed to fit this requirement is the obviously the
topic of discussion, but in the short term I would probably give up
and try something else. This really reminds me of nested set or
polymorphism hacks in SQL.


Re: reiterating transactions vs. replication

2009-05-25 Thread Yuval Kogman
2009/5/25 Chris Anderson :

> Right, but at least in those cases I can diff the document to figure
> out what the conflict is. The bulk transactions you describe could
> become conflicting and before I could save the doc I'm working on, I'd
> have to figure which other doc was causing the conflict.

The idea is that you can ask for a conflict earlier if that's what's
going to help. If you need more context then arguably there should be
a way to ask for it to not be thrown away.

If you have a conflict in the aforementioned sorted collections and/or
graph nodes you need to know the state of other documents to merge
correctly.

I still think this is out of scope anyway. For a proper merge you'd
want the common ancestor documents as well, not just the symmetric
difference. This is best done by implementing a versioning model on
top of couchdb. But to implement this model consistently you arguably
still need atomic primitives.

> I'm not sure
> why not to just call the larger unit of data a single document, if
> that's how you want to use it.

So basically instead of using multiple documents all of my data would
go in one document? Why didn't I think of that ;-)


Re: reiterating transactions vs. replication

2009-05-24 Thread Chris Anderson
On Sun, May 24, 2009 at 10:44 PM, Scott Shumaker  wrote:
> Chris - you said:
> "Distributed bulk transactions would make for chaotic behavior, as
> someone's mostly unrelated change on a remote node could eventually
> replicate to me (months later) and knock an entire line of work that
> I've done into a conflict state."
>
> Months later?  If you're really out of sync with a remote node for
> several months, you should expect to get lots of conflicts if people
> are editing your shared documents, with or without distributed bulk
> transactions.

Right, but at least in those cases I can diff the document to figure
out what the conflict is. The bulk transactions you describe could
become conflicting and before I could save the doc I'm working on, I'd
have to figure which other doc was causing the conflict. I'm not sure
why not to just call the larger unit of data a single document, if
that's how you want to use it.

>
> And if you really have a system designed to be offline for several
> months, you're solving very different problems (albeit interesting
> ones) than traditional data 'replication' as most people think about
> it (when one entity controls the servers, even if they are
> geographically distributed) - and getting into DVCS-like territory.

The idea is to support offline access to distributed data across all
devices. Some devices are very small (like a phone) and others are
very large (like a datacenter). It is our aim to provide a unified API
across all of them.


>
> If that's the case, it does explain why there are some decisions in
> CouchDB I find strange, like the fact that authentication runs during
> replication.

Auth during replication has to happen because as far as CouchDB's
concerned replication is just another client. Having the same API for
usage and replication is part of what keeps CouchDB simple.

> You're using replication for a completely different
> purpose than I need it for - I need it for redundancy and read
> scaling, you're using it to synchronize disparate data sources.  Very
> different problems that call for very different solutions.
>
>
> On Sun, May 24, 2009 at 10:31 PM, Scott Shumaker  wrote:
>> Inter-document dependencies come up pretty quickly when you start
>> trying to represent complex data structures in CouchDB.  There are
>> still a few cases we've encountered where there isn't a great way to
>> avoid needing transactions.  A few examples:
>>
>> 1)
>> 'A' maintains an ordered list of 'B' elements, where the order is
>> user-defined - imagine allowing a user to re-order photos in a
>> slideshow.  You want to store the photos separately from the
>> slideshow, because they can be used in multiple slideshows.  Whenever
>> you add or remove a photo, you need to update the order list as well.
>>
>> I've seen some people suggest some sort of gross solution where you
>> try to store floating point order id's inside the B elements and
>> change that to wedge an item in between another items (averaging the
>> two new siblings' orders), but this is incredibly brittle and breaks
>> down with enough re-ordering.
>>
>> Another unpleasant approach is to create separate 'order objects' in
>> couchdb (representing the order of an item within a folder), storing
>> an internal version number (timestamp) inside the 'order object' - so
>> you never change the order node, you just create a new node.  Then,
>> you only use use the 'latest' version of this order node (either on
>> the client side or with a reduce).  To make this work, you need to
>> ensure that your 'internal version numbers' are monotonically
>> increasing.  This isn't a problem for some applications, and can be
>> solved in general with a specialized 'number server'.
>>
>> 2)
>> Representing graph/linked-list datastructures.
>>
>> If you delete a node from a linked list, you need to update two nodes
>> - the previous node and the node itself.  You can try the second
>> suggestion in the previous item to make this work (effectively, store
>> the link relationship as separate objects and generate new link
>> objects with incrementing version numbers)
>>
>> I'm sure there are other cases - these two just have been a thorn in
>> our side.  But for a lot of non-trivial data applications,
>> transactions still end up being very useful.
>>
>> On Fri, May 22, 2009 at 2:45 PM, Nathan Stott  wrote:
>>> As a user, when I chose couchdb for my most recent project, I chose it
>>> because I didn't care about transactions.  I would've used RDBMS if that
>>> were important.
>>> I chose it because couch solved the problems I needed solved very well.
>>>
>>> I don't think transactions should be a big dev focus.
>>>
>>> On Fri, May 22, 2009 at 4:30 PM, Chris Anderson  wrote:
>>>
 On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman 
 wrote:
 > 2009/5/21 Adam Kocoloski :
 >> Hi Yuval, thanks for this well-written proposal.  I don't really want to
 >> rehash all the discussion from back in February (see the thread
 be

Re: reiterating transactions vs. replication

2009-05-24 Thread Antony Blakey


On 25/05/2009, at 3:01 PM, Scott Shumaker wrote:


'A' maintains an ordered list of 'B' elements, where the order is
user-defined - imagine allowing a user to re-order photos in a
slideshow.  You want to store the photos separately from the
slideshow, because they can be used in multiple slideshows.  Whenever
you add or remove a photo, you need to update the order list as well.

I've seen some people suggest some sort of gross solution where you
try to store floating point order id's inside the B elements and
change that to wedge an item in between another items (averaging the
two new siblings' orders), but this is incredibly brittle and breaks
down with enough re-ordering.


There is another solution to this problem that I posted to this list  
about 16 March 2009. In short:


The general solution is to treat (in abstract) the ordering of items  
as the in-order traversal of a binary tree. The brief form of the  
algorithm is to record in each item the path from the top as two bits  
e.g. 10 = termination, 01 = left, 11 = right. You then map that bit  
sequence, padded with 0, to an encoded form that preserves the  
ordering. Avoiding unbounded length requires a balanced tree, which  
requires transactional support. It has the benefit of a low number of  
documents touched per update (in an amortized sense).


By using 01 = left, 10 = termination, 11 = right, the length of the  
bit string becomes implicit (self-terminating) i.e. every pair  
contains a 1, and thus a function to compute an intermediate value  
given two *byte* sequences is easy. In practice you need to know the  
two adjacent values in order to avoid collisions, but you don't need  
to write to those documents.


This isn't brittle and won't break down, but as I say the insertion  
keys are unbounded.




I maintain a couchdb fork with transactional bulk doc commits, but  
they are really only useful to avoid requiring a merge interface for  
local operations. CouchDB replication still generates conflicts. If  
however you use a single write master, then transaction bulk doc  
commits can eliminate all conflicts.


Antony Blakey
-
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

It's amazing that the one side of the conversation that survived is "I  
don't know art, but I know what I like". The reply from the artist was  
"Madam, so does a cow".

  -- Carl Kirkendall




Re: reiterating transactions vs. replication

2009-05-24 Thread Paul Davis
On Mon, May 25, 2009 at 1:44 AM, Scott Shumaker  wrote:
> Chris - you said:
> "Distributed bulk transactions would make for chaotic behavior, as
> someone's mostly unrelated change on a remote node could eventually
> replicate to me (months later) and knock an entire line of work that
> I've done into a conflict state."
>
> Months later?  If you're really out of sync with a remote node for
> several months, you should expect to get lots of conflicts if people
> are editing your shared documents, with or without distributed bulk
> transactions.
>

I'm still mulling over Chris's concerns. My gut-brain tells me there's
a way around this but the idea hasn't floated up to the same brain
that controls actual thoughts.

> And if you really have a system designed to be offline for several
> months, you're solving very different problems (albeit interesting
> ones) than traditional data 'replication' as most people think about
> it (when one entity controls the servers, even if they are
> geographically distributed) - and getting into DVCS-like territory.
>

Heh. I think even in DVCS territory there'd still probably be a murder
if a dev didn't push for a couple months. Which is part of the idea
floating around. Ie, perhaps having a method for rejecting replication
that is too far out of whack (Think of git's "You can't push because
its not a fast forward") (This is all extreme hand waving so no one
take that too seriously).

> If that's the case, it does explain why there are some decisions in
> CouchDB I find strange, like the fact that authentication runs during
> replication.  You're using replication for a completely different
> purpose than I need it for - I need it for redundancy and read
> scaling, you're using it to synchronize disparate data sources.  Very
> different problems that call for very different solutions.
>


CouchDB has always had a focus on being able to be run in a completely
decentralized manner. As such there are features in CouchDB that
support this model. That said, it's also a goal of supporting as wide
a range of models as possible that fit into the general scheme of
things. For instance, if replication validation is something you want
to have configurable then you can create a ticket and dev@ thread
discussing the issue. I don't detect any issues in having that
disable-able but we should discuss that on a dedicated thread for
future-me's sanity.

HTH,
Paul Davis


> On Sun, May 24, 2009 at 10:31 PM, Scott Shumaker  wrote:
>> Inter-document dependencies come up pretty quickly when you start
>> trying to represent complex data structures in CouchDB.  There are
>> still a few cases we've encountered where there isn't a great way to
>> avoid needing transactions.  A few examples:
>>
>> 1)
>> 'A' maintains an ordered list of 'B' elements, where the order is
>> user-defined - imagine allowing a user to re-order photos in a
>> slideshow.  You want to store the photos separately from the
>> slideshow, because they can be used in multiple slideshows.  Whenever
>> you add or remove a photo, you need to update the order list as well.
>>
>> I've seen some people suggest some sort of gross solution where you
>> try to store floating point order id's inside the B elements and
>> change that to wedge an item in between another items (averaging the
>> two new siblings' orders), but this is incredibly brittle and breaks
>> down with enough re-ordering.
>>
>> Another unpleasant approach is to create separate 'order objects' in
>> couchdb (representing the order of an item within a folder), storing
>> an internal version number (timestamp) inside the 'order object' - so
>> you never change the order node, you just create a new node.  Then,
>> you only use use the 'latest' version of this order node (either on
>> the client side or with a reduce).  To make this work, you need to
>> ensure that your 'internal version numbers' are monotonically
>> increasing.  This isn't a problem for some applications, and can be
>> solved in general with a specialized 'number server'.
>>
>> 2)
>> Representing graph/linked-list datastructures.
>>
>> If you delete a node from a linked list, you need to update two nodes
>> - the previous node and the node itself.  You can try the second
>> suggestion in the previous item to make this work (effectively, store
>> the link relationship as separate objects and generate new link
>> objects with incrementing version numbers)
>>
>> I'm sure there are other cases - these two just have been a thorn in
>> our side.  But for a lot of non-trivial data applications,
>> transactions still end up being very useful.
>>
>> On Fri, May 22, 2009 at 2:45 PM, Nathan Stott  wrote:
>>> As a user, when I chose couchdb for my most recent project, I chose it
>>> because I didn't care about transactions.  I would've used RDBMS if that
>>> were important.
>>> I chose it because couch solved the problems I needed solved very well.
>>>
>>> I don't think transactions should be a big dev focus.
>>>
>>> On 

Re: reiterating transactions vs. replication

2009-05-24 Thread Scott Shumaker
Chris - you said:
"Distributed bulk transactions would make for chaotic behavior, as
someone's mostly unrelated change on a remote node could eventually
replicate to me (months later) and knock an entire line of work that
I've done into a conflict state."

Months later?  If you're really out of sync with a remote node for
several months, you should expect to get lots of conflicts if people
are editing your shared documents, with or without distributed bulk
transactions.

And if you really have a system designed to be offline for several
months, you're solving very different problems (albeit interesting
ones) than traditional data 'replication' as most people think about
it (when one entity controls the servers, even if they are
geographically distributed) - and getting into DVCS-like territory.

If that's the case, it does explain why there are some decisions in
CouchDB I find strange, like the fact that authentication runs during
replication.  You're using replication for a completely different
purpose than I need it for - I need it for redundancy and read
scaling, you're using it to synchronize disparate data sources.  Very
different problems that call for very different solutions.


On Sun, May 24, 2009 at 10:31 PM, Scott Shumaker  wrote:
> Inter-document dependencies come up pretty quickly when you start
> trying to represent complex data structures in CouchDB.  There are
> still a few cases we've encountered where there isn't a great way to
> avoid needing transactions.  A few examples:
>
> 1)
> 'A' maintains an ordered list of 'B' elements, where the order is
> user-defined - imagine allowing a user to re-order photos in a
> slideshow.  You want to store the photos separately from the
> slideshow, because they can be used in multiple slideshows.  Whenever
> you add or remove a photo, you need to update the order list as well.
>
> I've seen some people suggest some sort of gross solution where you
> try to store floating point order id's inside the B elements and
> change that to wedge an item in between another items (averaging the
> two new siblings' orders), but this is incredibly brittle and breaks
> down with enough re-ordering.
>
> Another unpleasant approach is to create separate 'order objects' in
> couchdb (representing the order of an item within a folder), storing
> an internal version number (timestamp) inside the 'order object' - so
> you never change the order node, you just create a new node.  Then,
> you only use use the 'latest' version of this order node (either on
> the client side or with a reduce).  To make this work, you need to
> ensure that your 'internal version numbers' are monotonically
> increasing.  This isn't a problem for some applications, and can be
> solved in general with a specialized 'number server'.
>
> 2)
> Representing graph/linked-list datastructures.
>
> If you delete a node from a linked list, you need to update two nodes
> - the previous node and the node itself.  You can try the second
> suggestion in the previous item to make this work (effectively, store
> the link relationship as separate objects and generate new link
> objects with incrementing version numbers)
>
> I'm sure there are other cases - these two just have been a thorn in
> our side.  But for a lot of non-trivial data applications,
> transactions still end up being very useful.
>
> On Fri, May 22, 2009 at 2:45 PM, Nathan Stott  wrote:
>> As a user, when I chose couchdb for my most recent project, I chose it
>> because I didn't care about transactions.  I would've used RDBMS if that
>> were important.
>> I chose it because couch solved the problems I needed solved very well.
>>
>> I don't think transactions should be a big dev focus.
>>
>> On Fri, May 22, 2009 at 4:30 PM, Chris Anderson  wrote:
>>
>>> On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman 
>>> wrote:
>>> > 2009/5/21 Adam Kocoloski :
>>> >> Hi Yuval, thanks for this well-written proposal.  I don't really want to
>>> >> rehash all the discussion from back in February (see the thread
>>> beginning at
>>> >>
>>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84f66023-030a-4669-b75c-3dcc92d71...@yahoo.com%3e
>>>  for
>>> >> a particularly detailed discussion), but I do want to comment on one
>>> aspect.
>>> >>
>>> >> Updating the replicator to be smart about atomic bulk transactions is
>>> doable
>>> >> (although a major undertaking), but when you throw DB compaction and
>>> >> revision stemming into the mix things get really hairy.  Recall that
>>> CouchDB
>>> >> revisions are used for concurrency control, not for maintaining history.
>>> >>  Consider the following sequence of events:
>>> >>
>>> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>>> >> 2) Update foo -> foo/2
>>> >> Compact the DB (foo/1 is deleted)
>>> >> Start replicating to a mirror
>>> >> Replication crashes before it reaches foo/2
>>> >
>>> > By crash you mean an error due to a conflict between foo/2 and foo/1'
>>> > (the mirror's ver

Re: reiterating transactions vs. replication

2009-05-24 Thread Scott Shumaker
Inter-document dependencies come up pretty quickly when you start
trying to represent complex data structures in CouchDB.  There are
still a few cases we've encountered where there isn't a great way to
avoid needing transactions.  A few examples:

1)
'A' maintains an ordered list of 'B' elements, where the order is
user-defined - imagine allowing a user to re-order photos in a
slideshow.  You want to store the photos separately from the
slideshow, because they can be used in multiple slideshows.  Whenever
you add or remove a photo, you need to update the order list as well.

I've seen some people suggest some sort of gross solution where you
try to store floating point order id's inside the B elements and
change that to wedge an item in between another items (averaging the
two new siblings' orders), but this is incredibly brittle and breaks
down with enough re-ordering.

Another unpleasant approach is to create separate 'order objects' in
couchdb (representing the order of an item within a folder), storing
an internal version number (timestamp) inside the 'order object' - so
you never change the order node, you just create a new node.  Then,
you only use use the 'latest' version of this order node (either on
the client side or with a reduce).  To make this work, you need to
ensure that your 'internal version numbers' are monotonically
increasing.  This isn't a problem for some applications, and can be
solved in general with a specialized 'number server'.

2)
Representing graph/linked-list datastructures.

If you delete a node from a linked list, you need to update two nodes
- the previous node and the node itself.  You can try the second
suggestion in the previous item to make this work (effectively, store
the link relationship as separate objects and generate new link
objects with incrementing version numbers)

I'm sure there are other cases - these two just have been a thorn in
our side.  But for a lot of non-trivial data applications,
transactions still end up being very useful.

On Fri, May 22, 2009 at 2:45 PM, Nathan Stott  wrote:
> As a user, when I chose couchdb for my most recent project, I chose it
> because I didn't care about transactions.  I would've used RDBMS if that
> were important.
> I chose it because couch solved the problems I needed solved very well.
>
> I don't think transactions should be a big dev focus.
>
> On Fri, May 22, 2009 at 4:30 PM, Chris Anderson  wrote:
>
>> On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman 
>> wrote:
>> > 2009/5/21 Adam Kocoloski :
>> >> Hi Yuval, thanks for this well-written proposal.  I don't really want to
>> >> rehash all the discussion from back in February (see the thread
>> beginning at
>> >>
>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84f66023-030a-4669-b75c-3dcc92d71...@yahoo.com%3e
>>  for
>> >> a particularly detailed discussion), but I do want to comment on one
>> aspect.
>> >>
>> >> Updating the replicator to be smart about atomic bulk transactions is
>> doable
>> >> (although a major undertaking), but when you throw DB compaction and
>> >> revision stemming into the mix things get really hairy.  Recall that
>> CouchDB
>> >> revisions are used for concurrency control, not for maintaining history.
>> >>  Consider the following sequence of events:
>> >>
>> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>> >> 2) Update foo -> foo/2
>> >> Compact the DB (foo/1 is deleted)
>> >> Start replicating to a mirror
>> >> Replication crashes before it reaches foo/2
>> >
>> > By crash you mean an error due to a conflict between foo/2 and foo/1'
>> > (the mirror's version of foo), right?
>> >
>> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>>  I
>> >> think this means we'd need to modify the compaction algorithm to keep
>> >> revisions of documents if a) the revision was part of an atomic
>> _bulk_docs,
>> >> and b) any of the documents in that transaction are still at the
>> revision
>> >> generated by the transaction.  Same thing goes for revision stemming --
>> we
>> >> can never drop revisions if they were part of an atomic upload and at
>> least
>> >> one of the document revs in the upload is still current.
>> >
>> > Yep. Personally I see this is a tradeoff, not a limitation per se. If
>> > you specify 'atomic' then you must pay more in terms of data size,
>> > performance, etc.
>>
>> The problem as I see it is that someone else's bulk transaction will
>> have to sit around in my database, until I edit all the docs in it.
>> Hopefully I won't get any distributed conflicts on other old versions
>> of docs in the group because this would put edits that I've done
>> locally to other documents in the bulk group, somehow less valid.
>>
>> Distributed bulk transactions would make for chaotic behavior, as
>> someone's mostly unrelated change on a remote node could eventually
>> replicate to me (months later) and knock an entire line of work that
>> I've done into a conflict state.
>>
>> If

Re: reiterating transactions vs. replication

2009-05-22 Thread Nathan Stott
As a user, when I chose couchdb for my most recent project, I chose it
because I didn't care about transactions.  I would've used RDBMS if that
were important.
I chose it because couch solved the problems I needed solved very well.

I don't think transactions should be a big dev focus.

On Fri, May 22, 2009 at 4:30 PM, Chris Anderson  wrote:

> On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman 
> wrote:
> > 2009/5/21 Adam Kocoloski :
> >> Hi Yuval, thanks for this well-written proposal.  I don't really want to
> >> rehash all the discussion from back in February (see the thread
> beginning at
> >>
> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84f66023-030a-4669-b75c-3dcc92d71...@yahoo.com%3e
>  for
> >> a particularly detailed discussion), but I do want to comment on one
> aspect.
> >>
> >> Updating the replicator to be smart about atomic bulk transactions is
> doable
> >> (although a major undertaking), but when you throw DB compaction and
> >> revision stemming into the mix things get really hairy.  Recall that
> CouchDB
> >> revisions are used for concurrency control, not for maintaining history.
> >>  Consider the following sequence of events:
> >>
> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
> >> 2) Update foo -> foo/2
> >> Compact the DB (foo/1 is deleted)
> >> Start replicating to a mirror
> >> Replication crashes before it reaches foo/2
> >
> > By crash you mean an error due to a conflict between foo/2 and foo/1'
> > (the mirror's version of foo), right?
> >
> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>  I
> >> think this means we'd need to modify the compaction algorithm to keep
> >> revisions of documents if a) the revision was part of an atomic
> _bulk_docs,
> >> and b) any of the documents in that transaction are still at the
> revision
> >> generated by the transaction.  Same thing goes for revision stemming --
> we
> >> can never drop revisions if they were part of an atomic upload and at
> least
> >> one of the document revs in the upload is still current.
> >
> > Yep. Personally I see this is a tradeoff, not a limitation per se. If
> > you specify 'atomic' then you must pay more in terms of data size,
> > performance, etc.
>
> The problem as I see it is that someone else's bulk transaction will
> have to sit around in my database, until I edit all the docs in it.
> Hopefully I won't get any distributed conflicts on other old versions
> of docs in the group because this would put edits that I've done
> locally to other documents in the bulk group, somehow less valid.
>
> Distributed bulk transactions would make for chaotic behavior, as
> someone's mostly unrelated change on a remote node could eventually
> replicate to me (months later) and knock an entire line of work that
> I've done into a conflict state.
>
> If you want atomicity, put it in a single document.
>
> Chris
>
> >
> > In 0.8 you would have theoretically had to pay by default, but didn't
> > because replication broke transactions.
> >
> > The basic algorithm is still the same, but the garbage collected unit
> > is changed (instead of garbage collecting document revisions it
> > garbage collects revision sets, with the current case being a set with
> > one member. The rules still apply (if this object is wholly shadowed
> > by non conflicting changes then it can be disposed of)). IIRC the
> > algorithm is a copying garbage collector, so this is pretty easy to do
> > (you walk a DAG instead of a linked list).
> >
> > Under the proposed model you'd choose which operations are
> > transactional and will have to pay for those.
> >
> >
> > Anwyay, thanks for your link as well, I was reading through a rather
> > boring thread and didn't see this one, so I guess I did miss out. It
> > seemed to imply the discussion was done only on IRC.
> >
> > Anyway, here goes...
> >
> > The fundamental problem is that any consistent data model needs at the
> > very least to have atomic primitives and ordered message passing (with
> > transactional message handlers) at the per-partition level, or
> > atomicity and consistency is restricted to a single document.
> >
> > What concerns me is Damien's post
> > (
> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872b8-152c-42a6-9324-dd52534d9...@apache.org%3e
> ):
> >
> >> No, CouchDB replication doesn't support replicating the transactions.
> >> Never has, never will. That's more like transaction log replication
> >> that's in traditonal dbs, a different beast.
> >>
> >> For the new bulk transaction model, I'm only proposing supporting
> >> eventual consistency. All changes are safe to disk, but the db may not
> >> be in a consistent state right away.
> >
> > From what I know this assumption is wrong. Eventual consistency still
> > needs atomic primitives, it's not about whether or not you have
> > transactions, it's about what data they affect (eventual consistency
> > involves breaking them down).
> >
> > 

Re: reiterating transactions vs. replication

2009-05-22 Thread Chris Anderson
On Thu, May 21, 2009 at 8:30 PM, Yuval Kogman  wrote:
> 2009/5/21 Adam Kocoloski :
>> Hi Yuval, thanks for this well-written proposal.  I don't really want to
>> rehash all the discussion from back in February (see the thread beginning at
>> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84f66023-030a-4669-b75c-3dcc92d71...@yahoo.com%3e for
>> a particularly detailed discussion), but I do want to comment on one aspect.
>>
>> Updating the replicator to be smart about atomic bulk transactions is doable
>> (although a major undertaking), but when you throw DB compaction and
>> revision stemming into the mix things get really hairy.  Recall that CouchDB
>> revisions are used for concurrency control, not for maintaining history.
>>  Consider the following sequence of events:
>>
>> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>> 2) Update foo -> foo/2
>> Compact the DB (foo/1 is deleted)
>> Start replicating to a mirror
>> Replication crashes before it reaches foo/2
>
> By crash you mean an error due to a conflict between foo/2 and foo/1'
> (the mirror's version of foo), right?
>
>> In your proposal, we should expect foo/1 to exist on the mirror, right?  I
>> think this means we'd need to modify the compaction algorithm to keep
>> revisions of documents if a) the revision was part of an atomic _bulk_docs,
>> and b) any of the documents in that transaction are still at the revision
>> generated by the transaction.  Same thing goes for revision stemming -- we
>> can never drop revisions if they were part of an atomic upload and at least
>> one of the document revs in the upload is still current.
>
> Yep. Personally I see this is a tradeoff, not a limitation per se. If
> you specify 'atomic' then you must pay more in terms of data size,
> performance, etc.

The problem as I see it is that someone else's bulk transaction will
have to sit around in my database, until I edit all the docs in it.
Hopefully I won't get any distributed conflicts on other old versions
of docs in the group because this would put edits that I've done
locally to other documents in the bulk group, somehow less valid.

Distributed bulk transactions would make for chaotic behavior, as
someone's mostly unrelated change on a remote node could eventually
replicate to me (months later) and knock an entire line of work that
I've done into a conflict state.

If you want atomicity, put it in a single document.

Chris

>
> In 0.8 you would have theoretically had to pay by default, but didn't
> because replication broke transactions.
>
> The basic algorithm is still the same, but the garbage collected unit
> is changed (instead of garbage collecting document revisions it
> garbage collects revision sets, with the current case being a set with
> one member. The rules still apply (if this object is wholly shadowed
> by non conflicting changes then it can be disposed of)). IIRC the
> algorithm is a copying garbage collector, so this is pretty easy to do
> (you walk a DAG instead of a linked list).
>
> Under the proposed model you'd choose which operations are
> transactional and will have to pay for those.
>
>
> Anwyay, thanks for your link as well, I was reading through a rather
> boring thread and didn't see this one, so I guess I did miss out. It
> seemed to imply the discussion was done only on IRC.
>
> Anyway, here goes...
>
> The fundamental problem is that any consistent data model needs at the
> very least to have atomic primitives and ordered message passing (with
> transactional message handlers) at the per-partition level, or
> atomicity and consistency is restricted to a single document.
>
> What concerns me is Damien's post
> (http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872b8-152c-42a6-9324-dd52534d9...@apache.org%3e):
>
>> No, CouchDB replication doesn't support replicating the transactions.
>> Never has, never will. That's more like transaction log replication
>> that's in traditonal dbs, a different beast.
>>
>> For the new bulk transaction model, I'm only proposing supporting
>> eventual consistency. All changes are safe to disk, but the db may not
>> be in a consistent state right away.
>
> From what I know this assumption is wrong. Eventual consistency still
> needs atomic primitives, it's not about whether or not you have
> transactions, it's about what data they affect (eventual consistency
> involves breaking them down).
>
> Anyway, "never will" sounds pretty binding, but for the sake of argument:
>
> By using only insertions and idempotent updates for the bulk of the
> data changes and a message queue whose handlers use atomic updates to
> integrate this data one can implement a truly atomic distributed
> model, or an eventual consistency, but without this updates need to be
> restricted to exactly one document.
>
> Eventual consistency is still possible using either locks or by
> breaking down what would have been large distributed transactions into
> smaller ones, but 

Re: reiterating transactions vs. replication

2009-05-22 Thread Randall Leeds
On Fri, May 22, 2009 at 12:27, Randall Leeds wrote:
>
>
> Since I do like the component model, I'm planning to set up a github
> project to play with some consensus protocols and overlay networks in
> Erlang. Hopefully once I start doing that I'll start to see the places that
> CouchDB can hook into it and get a nice, clean, flexible API. I see the
> problem broken into several tiers.
>
> Transactional Bulk Docs (this is the wishlist and challenge, but has to
> rest on the below)
> Sharding/Replication (_seq consensus / possibly consistent hashing or other
> distributed, deterministic data structure mapping BTree nodes to servers
> [2])
> Communication (either Erlang or a tcp with pluggable overlay-network for
> routing)
>

A revised break-down should be something like:

Transactional Bulk-Docs
Single-Doc Multi-Replica Transactions
Replication / Sharding
Network

Example:

Transactional Bulk-Docs (Server pre-prepares itself as leader for a special
bulk round)
Single-Doc Multi-Replica Transactions (Simple consensus. Special leader for
bulk case. Pre-determined leader normally.)
Replication / Sharding (Any sort of load-balancing, slicing, or static
configuration)
Network (Chord and derivatives (Scalaris uses Chord #), Tapestry, Pastry,
etc)

I think with the right configurations and components transactional bulk-docs
are just a special case of single-doc transactions. For example, in case the
single-doc layer optimizes for less communication rounds by pre-selecting
leaders on a rotating basis a bulk transaction just involves revoking all
nodes for a sequence number consensus round and using an extra round trip to
"take over" the leader position. Then all nodes holding replicas of all
documents involved would have to participate in this new round (or at least
a majority of replicas). Having 'atomic=false' could skip this expense and
make a best-effort serial execution of the updates and fail on conflict.

Just trying to keep the conversation rolling. But I understand we have to
hit the code soon if this really stands to go somewhere.


Re: reiterating transactions vs. replication

2009-05-22 Thread Randall Leeds
On Fri, May 22, 2009 at 00:34, Paul Davis wrote:

> >>
> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
> >> 2) Update foo -> foo/2
> >> Compact the DB (foo/1 is deleted)
> >> Start replicating to a mirror
> >> Replication crashes before it reaches foo/2
> >
> > By crash you mean an error due to a conflict between foo/2 and foo/1'
> > (the mirror's version of foo), right?
> >
>
> Pretty sure he means the network link fails (or code fails, etc).


What about "a node has been compromised" or "someone is spoofing messages
from one of the nodes". These questions lead me to thinking about a plug-in
component model for this work. I can imagine very different requirements
with very different overhead even to just maintain the "normal" couchdb
guarantees (putting aside any transactional _bulk_docs).


> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>  I
> >> think this means we'd need to modify the compaction algorithm to keep
> >> revisions of documents if a) the revision was part of an atomic
> _bulk_docs,
> >> and b) any of the documents in that transaction are still at the
> revision
> >> generated by the transaction.  Same thing goes for revision stemming --
> we
> >> can never drop revisions if they were part of an atomic upload and at
> least
> >> one of the document revs in the upload is still current.
> >
>
> In general, there were two main ideas that I saw when reading
> literature on this subject:
>
> 1. Keep some sort of edit history
> 2. If a replication event occurs, and the target node is too out of
> date, then trigger a full database copy.
>
> At one point I suggested something like:
>
> 1. Keep some sort of edit history.
>We already do this. And with revision stemming we already have a
> configurable "How much history is kept" option. The fancy twist is
> that we don't remove old revisions from the update sequence btree
> until the revision stemming removes the revision info data. These
> revisions are then replicated as part of normal replication.
>
> 2. In the case of replicating from a node that's too far out of whack,
> instead of a full database copy, we just fall back to our current
> replication scheme in that we lose all transaction guarantees (or the
> guarantees that we can no longer guarantee, this is all quite hand
> wavy).
>
> For point 1, I can see one of a few methods to deal with transactions.
> Either we don't commit any of the docs in the transaction until they
> all make it across the wire, or we just mark them as a conflict (with
> maybe a 'in_transaction' modifier or some such). Keeping track of
> revisions is pretty cake because all the documents would be sequential
> in the update sequence btree. And it should also be easy to tell when
> a transaction is so old that we no longer have all the data necessary
> to make it work.


To be clear, not all documents are sequential in the update tree unless we
run some consensus protocol to decide the ordering or come up with some
other vector clock type solution. I don't know much about the latter, but
they've come up in discussions before on this list. I've thought about this
a lot lately and I really like the techniques of Mencius [1] which runs a
simplified Paxos that commits after only two communication rounds.

There's a huge win if we want to allow re-ordering of commits. This is
probably the case unless the application assumes some dependency between
documents (frequently noted as a Bad Idea). Many servers can commit writes
in one communication round. For example, a server can accept some other
server's proposal for sequence number i and commit i+1 (assuming it is the
leader for round i+1) even before it learns the result of the conensus for i
as long as i+1 and i touch different documents.

For point 2, I think we should make a distinction between inter-node
replication and replication into and out of the clustered deployment in
order to discuss it well. Inter-node migration of shards might rely on
replication, but if this is the case it should probably be triggered and
managed "from above". In other words, it might involve passing around a
"view id" which increments on any shard migration as well and having nodes
propose "view changes" to the update sequence consensus when shards migrate.
When a view change proposal is accapted, replication starts. Only when
replication ends does the rest of the group "learn" the new mapping. If the
nodes cooperate on changing to a new view with a new shard-to-node mapping I
don't think there should ever be conflicts caused by replication.

Some thought needs to go into the other scenario (replicating in/out with
the cluster viewed as a single CouchDB instance), but something tells me if
we get globally ordered seq numbers it's trivial.


> >
> > However, in another post Damien said:
> >
> >> Which is why in general you want to avoid inter-document dependencies,
> >> or be relaxed in how you deal with them.
>

See my point above.


> Though

Re: reiterating transactions vs. replication

2009-05-21 Thread Paul Davis
>>
>> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
>> 2) Update foo -> foo/2
>> Compact the DB (foo/1 is deleted)
>> Start replicating to a mirror
>> Replication crashes before it reaches foo/2
>
> By crash you mean an error due to a conflict between foo/2 and foo/1'
> (the mirror's version of foo), right?
>

Pretty sure he means the network link fails (or code fails, etc).

>> In your proposal, we should expect foo/1 to exist on the mirror, right?  I
>> think this means we'd need to modify the compaction algorithm to keep
>> revisions of documents if a) the revision was part of an atomic _bulk_docs,
>> and b) any of the documents in that transaction are still at the revision
>> generated by the transaction.  Same thing goes for revision stemming -- we
>> can never drop revisions if they were part of an atomic upload and at least
>> one of the document revs in the upload is still current.
>

In general, there were two main ideas that I saw when reading
literature on this subject:

1. Keep some sort of edit history
2. If a replication event occurs, and the target node is too out of
date, then trigger a full database copy.

At one point I suggested something like:

1. Keep some sort of edit history.
We already do this. And with revision stemming we already have a
configurable "How much history is kept" option. The fancy twist is
that we don't remove old revisions from the update sequence btree
until the revision stemming removes the revision info data. These
revisions are then replicated as part of normal replication.

2. In the case of replicating from a node that's too far out of whack,
instead of a full database copy, we just fall back to our current
replication scheme in that we lose all transaction guarantees (or the
guarantees that we can no longer guarantee, this is all quite hand
wavy).

For point 1, I can see one of a few methods to deal with transactions.
Either we don't commit any of the docs in the transaction until they
all make it across the wire, or we just mark them as a conflict (with
maybe a 'in_transaction' modifier or some such). Keeping track of
revisions is pretty cake because all the documents would be sequential
in the update sequence btree. And it should also be easy to tell when
a transaction is so old that we no longer have all the data necessary
to make it work.

As Yuval describes, the underlying idea would be that you only pay the
cost if you so choose.

On the flip side, this adds a decent amount of complexity to the
replicator and book keeping to other parts of the database.

> Yep. Personally I see this is a tradeoff, not a limitation per se. If
> you specify 'atomic' then you must pay more in terms of data size,
> performance, etc.
>

[snip]

> What concerns me is Damien's post
> (http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872b8-152c-42a6-9324-dd52534d9...@apache.org%3e):
>
>> No, CouchDB replication doesn't support replicating the transactions.
>> Never has, never will. That's more like transaction log replication
>> that's in traditonal dbs, a different beast.
>>
>> For the new bulk transaction model, I'm only proposing supporting
>> eventual consistency. All changes are safe to disk, but the db may not
>> be in a consistent state right away.
>
> From what I know this assumption is wrong. Eventual consistency still
> needs atomic primitives, it's not about whether or not you have
> transactions, it's about what data they affect (eventual consistency
> involves breaking them down).
>

I'm not sure I follow this part. What aspect of eventual consistency
requires atomicity guarantees? CouchDB eventual consistency is like
making dinner plans with a large group of friends. Sometimes different
parts of the network might have a different idea of which restaurant
everyone's meeting at, but assuming everyone remembered to charge
their phones eventually everyone will get to the right place.

> Anyway, "never will" sounds pretty binding, but for the sake of argument:
>

I think he was referring to the heavy log replication stuff that
RDBMS' tend towards. From what I've read these types of approaches
require runtime characteristics that don't fit with the rest of
CouchDB.

If we found a transaction model that worked without hampering the core
design goals of CouchDB then I'm pretty sure everyone would be
extremely enthused about it.

[snip]

>
> However, in another post Damien said:
>
>> Which is why in general you want to avoid inter-document dependencies,
>> or be relaxed in how you deal with them.
>
> So I think I best shut up after this without some decision maker
> telling me not to, if my use case is not covered by the intended
> design then that's that, but I do think this thread sort of covers
> this:
>

Damien's advice is the best idea for most scenarios. It may end up
causing a bit more planning up front for what happens if you have
conflicts and how to take of such things, but as it turns out, once
you have it working, then y

Re: reiterating transactions vs. replication

2009-05-21 Thread Yuval Kogman
2009/5/22 Yuval Kogman :

> From what I know this assumption is wrong. Eventual consistency still
> needs atomic primitives, it's not about whether or not you have
> transactions, it's about what data they affect (eventual consistency
> involves breaking them down).

Found this link in the archive as well:

http://queue.acm.org/detail.cfm?id=1394128

I think it explains why better than I do.


Re: reiterating transactions vs. replication

2009-05-21 Thread Yuval Kogman
2009/5/21 Adam Kocoloski :
> Hi Yuval, thanks for this well-written proposal.  I don't really want to
> rehash all the discussion from back in February (see the thread beginning at
> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84f66023-030a-4669-b75c-3dcc92d71...@yahoo.com%3e for
> a particularly detailed discussion), but I do want to comment on one aspect.
>
> Updating the replicator to be smart about atomic bulk transactions is doable
> (although a major undertaking), but when you throw DB compaction and
> revision stemming into the mix things get really hairy.  Recall that CouchDB
> revisions are used for concurrency control, not for maintaining history.
>  Consider the following sequence of events:
>
> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
> 2) Update foo -> foo/2
> Compact the DB (foo/1 is deleted)
> Start replicating to a mirror
> Replication crashes before it reaches foo/2

By crash you mean an error due to a conflict between foo/2 and foo/1'
(the mirror's version of foo), right?

> In your proposal, we should expect foo/1 to exist on the mirror, right?  I
> think this means we'd need to modify the compaction algorithm to keep
> revisions of documents if a) the revision was part of an atomic _bulk_docs,
> and b) any of the documents in that transaction are still at the revision
> generated by the transaction.  Same thing goes for revision stemming -- we
> can never drop revisions if they were part of an atomic upload and at least
> one of the document revs in the upload is still current.

Yep. Personally I see this is a tradeoff, not a limitation per se. If
you specify 'atomic' then you must pay more in terms of data size,
performance, etc.

In 0.8 you would have theoretically had to pay by default, but didn't
because replication broke transactions.

The basic algorithm is still the same, but the garbage collected unit
is changed (instead of garbage collecting document revisions it
garbage collects revision sets, with the current case being a set with
one member. The rules still apply (if this object is wholly shadowed
by non conflicting changes then it can be disposed of)). IIRC the
algorithm is a copying garbage collector, so this is pretty easy to do
(you walk a DAG instead of a linked list).

Under the proposed model you'd choose which operations are
transactional and will have to pay for those.


Anwyay, thanks for your link as well, I was reading through a rather
boring thread and didn't see this one, so I guess I did miss out. It
seemed to imply the discussion was done only on IRC.

Anyway, here goes...

The fundamental problem is that any consistent data model needs at the
very least to have atomic primitives and ordered message passing (with
transactional message handlers) at the per-partition level, or
atomicity and consistency is restricted to a single document.

What concerns me is Damien's post
(http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872b8-152c-42a6-9324-dd52534d9...@apache.org%3e):

> No, CouchDB replication doesn't support replicating the transactions.
> Never has, never will. That's more like transaction log replication
> that's in traditonal dbs, a different beast.
>
> For the new bulk transaction model, I'm only proposing supporting
> eventual consistency. All changes are safe to disk, but the db may not
> be in a consistent state right away.

>From what I know this assumption is wrong. Eventual consistency still
needs atomic primitives, it's not about whether or not you have
transactions, it's about what data they affect (eventual consistency
involves breaking them down).

Anyway, "never will" sounds pretty binding, but for the sake of argument:

By using only insertions and idempotent updates for the bulk of the
data changes and a message queue whose handlers use atomic updates to
integrate this data one can implement a truly atomic distributed
model, or an eventual consistency, but without this updates need to be
restricted to exactly one document.

Eventual consistency is still possible using either locks or by
breaking down what would have been large distributed transactions into
smaller ones, but the key is that the code that will make things
actually consistent must still have ACID guarantees (and be dispatched
in order).

The 0.9 model CouchDB is effectively MyISAM without data loss, but
just because the data is around doesn't mean it's possible to know
what to do with it (loss of context), or even fix it safely (the
conflict resolution code is susceptible to conflicts too).

Unfortunately for eventual consistency to actually work the breaking
down of operations must be done on application level, the database
can't decide which data can be deferred and which data cannot.

All immutable data and all new data can obviously be added to the
database outside of a transaction, but eventually a transaction
linking this data must be part of an atomic mutation.

The only way to support this without atomi

Re: reiterating transactions vs. replication

2009-05-21 Thread Adam Kocoloski
Hi Yuval, thanks for this well-written proposal.  I don't really want  
to rehash all the discussion from back in February (see the thread  
beginning at http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84f66023-030a-4669-b75c-3dcc92d71...@yahoo.com%3e 
 for a particularly detailed discussion), but I do want to comment on  
one aspect.


Updating the replicator to be smart about atomic bulk transactions is  
doable (although a major undertaking), but when you throw DB  
compaction and revision stemming into the mix things get really  
hairy.  Recall that CouchDB revisions are used for concurrency  
control, not for maintaining history.  Consider the following sequence  
of events:


1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
2) Update foo -> foo/2
Compact the DB (foo/1 is deleted)
Start replicating to a mirror
Replication crashes before it reaches foo/2

In your proposal, we should expect foo/1 to exist on the mirror,  
right?  I think this means we'd need to modify the compaction  
algorithm to keep revisions of documents if a) the revision was part  
of an atomic _bulk_docs, and b) any of the documents in that  
transaction are still at the revision generated by the transaction.   
Same thing goes for revision stemming -- we can never drop revisions  
if they were part of an atomic upload and at least one of the document  
revs in the upload is still current.


Do you agree?  Best, Adam

On May 21, 2009, at 7:00 AM, Yuval Kogman wrote:


Hello,

In 0.9 CouchDB removed the transactional bulk docs feature in favour
of simplifying sharding/replication.

The priorities behind this decision as I understood them are:

   1. ensure that applications developed in a single server don't
suffer from a degradation of guarantees if deployed using sharding

   2. avoid the issues involving transactional


I apologize if this proposal has already dismissed before. I did
search the mailing list archives, but mostly found a discussion on why
this stuff should not be done on IRC. I blame Jan for encouraging me
to post ;-)



So anyway, I think that we can have both features without needing to
implement something like Paxos, and without silently breaking apps
when they move to a sharding setup from a single machine setup.


The basic idea is to keep the conflicts-are-data approach, keeping the
current user visible replication and conflict resolution, but to allow
the user to ask for stricter conflict checking.

The api I propose is for the bulk docs operation to have an optional
'atomic' flag. When this flag is set CouchDB would atomically verify
that all documents were committed without conflict (with respect to
the supplied _rev), and if any one document conflicts, mark all of
them as conflicting.

Transaction recovery, conflict resolution etc is still the
responsibility of the application, but provides an atomic guarantee
that an inconsistent transaction will fail as a whole if it tries to
write inconsistent data to the database, a guarantee that cannot be
made using a client library (there are race conditions).



Now the hard parts:


1. Replication

The way I understand it replication currently works on a per document
approach. If 'atomic' was specified in a bulk operation I propose that
all the revisions created in that bulk operation were kept linked. If
these linked revisions are being replicated, the same conflict
resolution must be applied (the replication of the document itself is
executed as bulk operation with aotmic=true, replicating all
associated documents as well).

The caveat is that even if you always use bulk docs with the atomic
flag, if you a switch replica you could lose the D out of ACID:
documents which are marked as non conflicting in your replica might be
conflicting in the replica you switch to, in which case transactions
that have already been committed appear to be rolled back from the
application's point of view.

This problem obviously already exists in the current implementation,
but when 'atomic' is specified it could potentially happen a lot more
often.


2. Data sharding

This one is tricker. Two solutions both of which I think are
acceptable, and either or both of which could be used:


The easy way is to ignore this problem. Well not really: The client
must ensure that all the documents affected by a single transaction
are in the same shard, by using a partitioning scheme that allows
this.

If a bulk_docs operation with atomic set to true would affect multiple
shards, that is an error (the data could still be written as a
conflict, of course).

If you want to enable the 'atomic' flag you'll need to be careful
about how you use sharding. You can still use it for some of the
transactions, but not all the time. I think this is a flexible and
pragmatic solution.

This means that if you choose to opt in to fully atomic bulk doc
operations your app might not be deployable unmodified to a sharded
setup, but it's never unsafe (no data inconsistencies).

I

reiterating transactions vs. replication

2009-05-21 Thread Yuval Kogman
Hello,

In 0.9 CouchDB removed the transactional bulk docs feature in favour
of simplifying sharding/replication.

The priorities behind this decision as I understood them are:

1. ensure that applications developed in a single server don't
suffer from a degradation of guarantees if deployed using sharding

2. avoid the issues involving transactional


I apologize if this proposal has already dismissed before. I did
search the mailing list archives, but mostly found a discussion on why
this stuff should not be done on IRC. I blame Jan for encouraging me
to post ;-)



So anyway, I think that we can have both features without needing to
implement something like Paxos, and without silently breaking apps
when they move to a sharding setup from a single machine setup.


The basic idea is to keep the conflicts-are-data approach, keeping the
current user visible replication and conflict resolution, but to allow
the user to ask for stricter conflict checking.

The api I propose is for the bulk docs operation to have an optional
'atomic' flag. When this flag is set CouchDB would atomically verify
that all documents were committed without conflict (with respect to
the supplied _rev), and if any one document conflicts, mark all of
them as conflicting.

Transaction recovery, conflict resolution etc is still the
responsibility of the application, but provides an atomic guarantee
that an inconsistent transaction will fail as a whole if it tries to
write inconsistent data to the database, a guarantee that cannot be
made using a client library (there are race conditions).



Now the hard parts:


1. Replication

The way I understand it replication currently works on a per document
approach. If 'atomic' was specified in a bulk operation I propose that
all the revisions created in that bulk operation were kept linked. If
these linked revisions are being replicated, the same conflict
resolution must be applied (the replication of the document itself is
executed as bulk operation with aotmic=true, replicating all
associated documents as well).

The caveat is that even if you always use bulk docs with the atomic
flag, if you a switch replica you could lose the D out of ACID:
documents which are marked as non conflicting in your replica might be
conflicting in the replica you switch to, in which case transactions
that have already been committed appear to be rolled back from the
application's point of view.

This problem obviously already exists in the current implementation,
but when 'atomic' is specified it could potentially happen a lot more
often.


2. Data sharding

This one is tricker. Two solutions both of which I think are
acceptable, and either or both of which could be used:


The easy way is to ignore this problem. Well not really: The client
must ensure that all the documents affected by a single transaction
are in the same shard, by using a partitioning scheme that allows
this.

If a bulk_docs operation with atomic set to true would affect multiple
shards, that is an error (the data could still be written as a
conflict, of course).

If you want to enable the 'atomic' flag you'll need to be careful
about how you use sharding. You can still use it for some of the
transactions, but not all the time. I think this is a flexible and
pragmatic solution.

This means that if you choose to opt in to fully atomic bulk doc
operations your app might not be deployable unmodified to a sharded
setup, but it's never unsafe (no data inconsistencies).

In my opinion this satisfies the requirement for no degredation of
guarantees. It might not Just Work, but you can't have your cake and
eat it too at the end of the day.




The second way is harder but potentially still interesting. I've
included it mostly for the sake of discussion.

The core idea is to provide low level primitives on top of which a
client or proxy can implement a multi phase commit protocol.

The number of nodes involved is in the transaction depends on the data
in the transaction (it doesn't need to coordinate all the nodes in the
cluster).

Basically this would breakdown bulk doc calls into several steps.
First all the data is inserted to the backend, but it's set as
conflicting so that it's not accidentally visible.

This operation returns an identifier for the bulk doc operation
(essentially a ticket for a prepared transaction).

Once the data is available on all the shards it must be made live
atomically. A two phase commit starts by acquiring locks on all the
the transaction tickets and trying to apply them (the 'promise'
phase), and then finalizing that application atomically (the 'accept'
phase).

To keep things simple the two phases should be scoped to a single keep
alive connection. If the connection drops the locks should be
released.

Obviously Paxos ensues, but here's the catch:

 - The synchronization can be implemented first as a 3rd party
component, it doesn't need to affect CouchDB's core

 - The atomic primitives are also useful for writin