Re: Bulk Load

Paul Davis Fri, 19 Sep 2008 06:25:59 -0700

Ronny,

Replication will flag a document as conflicting. CouchDB will choose a
document to be the current version and mark that there are conflicts.
Then its your responsibility to resolve the conflict as a user.


Paul

On Fri, Sep 19, 2008 at 3:15 AM, Ronny Hanssen <[EMAIL PROTECTED]> wrote:
> Thanks Paul.
> I am starting to accept the fact that internal rev-control is not to be
> used. And I feel I am barely starting to understand why it shouldn't be
> used. I thought that compacting was the only reason (and compacting is
> manual and controlled - at least for know...), so that's wy I am being
> stubborn about it. :)
>
> I understood how the process worked, but my problem was how the A revisions
> are merged? How do I handle the conflict during replication? I have never
> actually performed a replication before. Will it do internal conflict
> management and the latest edit wins? Or, is the process calling a conflict
> handler I set up beforehand? Or is the item flagged in order for me to pick
> up conflicting docs after replication? This is one of the reasons I am
> holding back on the idea of replication (for any other purpose than backup).
>
> Thanks again for a descriptive answer, it's most appreciated.
>
> Regards,
> Ronny
>
> 2008/9/18 Paul Davis <[EMAIL PROTECTED]>
>
>> So two things here. First the update steps similar to before:
>>
>> 1. Get current version of document id X
>> 2. Clone doc X making doc Y
>> 3. Make doc Y a history doc:
>>    a. Y._id = new_uuid()
>>    b. Y.is_current_revision = false
>>    c. Delete Y._rev
>>    e. X.previous_version = Y._id
>> 4. Edit doc X as desired
>> 5. In a single HTTP request, send both documents to _bulk_docs
>>
>> So, given a document A, we have a current history of A.previous = C._id
>> Getting A, clone A to get B and edit as per step 3
>> Now we have A.previous = B, B.previous = C or A -> B -> C
>>
>> To make this permanent, we post to _bulk_docs both A and B. If A was
>> edited simultaneously, our A will get rejected as will B. So nothing
>> changed, you'd resolve this for as per any other normal situation.
>>
>> This will work in face of replication. Just the same as per any other
>> replication we may have to resolve conflicts, but our histories should
>> never conflict. What this system does introduce is this:
>>
>> Given that someone did the A->B->C above, say someone else
>> simultaneously A->D->C and we replicate.
>>
>> B, C, and D will not conflict.
>>
>> The two versions of A will. We resolve this as we would have for any
>> case. Then we indicate that A now has *two* previous histories. A-> (B
>> or D) -> C
>>
>> Full Stop.
>>
>> Using the built in revisioning for app dependent revisioning is a Bad
>> Idea &trade;. Its not meant for that and shouldn't be relied on. I am
>> not saying "Don't use the built in rev-control for rev-control." I'm
>> saying "Don't use the builtin collision detection system that is not
>> at all meant for rev-control for rev-control." I know, shades of gray
>> and all.
>>
>> The single node case can't be handled by the internal revision
>> control. You may think it can. It may look like it can. But it just
>> can't. You'll be whistling along and then wham! Something will happen
>> and you'll be up shit creek. (something will happen = accidental
>> compaction, need for replication, changes to couchdb internals
>> invalidating this approach, meteor hits your datacenter, you get the
>> idea)
>>
>> We can't use the internal _rev system for multi-node stuff because old
>> revisions are never replicated. Not even attempted at being
>> replicated. CouchDB idealogy says that there is one version of each
>> document, the most recent revision. Yes, it is possible to obtain
>> previous revisions making it look like revision control, but that's an
>> effect of implementation and hence should not be relied upon (Caveats
>> apply, using for things like undo etc are probably kosher as long as
>> you handle the possibly missing document etc).
>>
>> HTH,
>> Paul
>>
>>
>> On Thu, Sep 18, 2008 at 11:13 AM, Ronny Hanssen <[EMAIL PROTECTED]>
>> wrote:
>> > Ok, I get it... I understand bulk_docs is atomic, but I missed out on
>> that
>> > you actually preserved the *original* doc.id (doh). I thought that with
>> > clone you meant a new doc in CouchDB, with it's own id. And I just
>> couldn't
>> > understand why you did that :). This now makes more sense to me. Sorry.
>> >> As to replication, what you'd need is a flag that says if a particular
>> >> node is the head node. Then your history docs should never clash. If
>> >> you get conflicts on the head node you resolve them and store all
>> >> conflicting previous revisions. In this manner your linked list
>> >> becomes a linked directed acyclic graph. (Yay college) This does mean
>> >> that at any given point in the history you could possibly have
>> >> multiple versions of the same doc, but replication works.
>> >
>> > Ok, but how is that flag supposed to be set? At the time of inserting
>> with
>> > _bulk_docs the system needs to update the current, which means that any
>> node
>> > racing during an update will flag it to be current and actual. Which
>> means
>> > that replication in race conditions will conflict(?).
>> >
>> > I am just asking because the single node case could be handled by the
>> > internal CouchDB revision control. So, using the elaborate scheme you
>> > propose isn't really helping for that scenario. My impression was that we
>> > cannot use the internal CouchDB due to the difficulties in handling
>> > conflicts with multiple nodes involved (because conflicts could/would
>> > occur), and that this would be better handled by manual hand-coded
>> > rev-control.
>> >
>> > It seems to me that there are no solutions on how to do this by hand
>> coding
>> > either. So, it seems we are saying "don't use the built-in rev-control
>> for
>> > rev-control of data" to avoid people blaming CouchDB when the built in
>> > "revision control" conflicts.
>> >
>> > Thanks for your patience guys.
>> >
>> > ~Ronny
>> >
>> > 2008/9/18 Paul Davis <[EMAIL PROTECTED]>
>> >
>> >> Ronny,
>> >>
>> >> There are two points that I think you're missing.
>> >>
>> >> 1. _bulk_docs is atomic. As in, if one doc fails, they all fail.
>> >> 2. I was trying to make sure that the latest _id of a doc is constant.
>> >>
>> >> Think of this as a linked list. You grab the head document (most
>> >> current revision) and clone it. Then we change the uuid of the second
>> >> doc and make our pointer links to fit into the list. Then after making
>> >> the necessary changes, we edit the head node to our desire. Now we
>> >> post *both* (in the same HTTP request!) docs to _bulk_docs. This
>> >> ensures that if someone else edited this particular doc, the revisions
>> >> will be different and the second edit would fail. Thus, on success 2
>> >> docs are inserted, on failure, 0 docs.
>> >>
>> >> As to replication, what you'd need is a flag that says if a particular
>> >> node is the head node. Then your history docs should never clash. If
>> >> you get conflicts on the head node you resolve them and store all
>> >> conflicting previous revisions. In this manner your linked list
>> >> becomes a linked directed acyclic graph. (Yay college) This does mean
>> >> that at any given point in the history you could possibly have
>> >> multiple versions of the same doc, but replication works.
>> >>
>> >> For views, you'd just want to have a flag that says "Not the most
>> >> recent version." Then in your view you would know whether to emit
>> >> key/value pairs for it. This could be something like "No next version
>> >> pointer" or some such. Actually, this couldn't be a next pointer
>> >> without two initial gets because you'd need to get the head node and
>> >> next node. A boolean flag indicating head node status would be
>> >> sufficient though. And then you could have a history view if you ever
>> >> need to walk from tail to head
>> >>
>> >> HTH,
>> >> Paul
>> >>
>> >>
>> >> On Wed, Sep 17, 2008 at 9:35 PM, Ronny Hanssen <[EMAIL PROTECTED]>
>> >> wrote:
>> >> > Hm.
>> >> >
>> >> > In Paul's case I am not 100% sure what is going on. Here's a use case
>> for
>> >> > two concurrent edits:
>> >> >  * First two users get the original.
>> >> >  * Both makes a copy which they save.
>> >> > This means that there are two fresh docs in CouchDB (even on a single
>> >> > node).
>> >> >  * Save the original using a new doc._id (which the copy is to persist
>> in
>> >> > copy.previous_version).
>> >> > This means that the two new docs know where to find their  previous
>> >> > versions. The problem I have with this scheme is that every change of
>> a
>> >> > document means that it needs to store not only the new version, but
>> also
>> >> > it's old version (in addition to the original). The fact that two
>> racing
>> >> > updates will generate 4(!) new docs in addition to the original
>> document
>> >> is
>> >> > worrying. I guess Paul also want the original to be marked as deleted
>> in
>> >> the
>> >> > _bulk_docs? But, in any case the previous version are now new two new
>> >> docs,
>> >> > but they look exactly the same, except for the doc._id, naturally...
>> >> >
>> >> > Wouldn't this be enough Paul?
>> >> > 1. old = get_doc()
>> >> > 2. update = clone(old);
>> >> > 3. update.previous_version = old._id;
>> >> > 4. post via _bulk_docs
>> >> >
>> >> > This way there won't be multiple old docs around.
>> >> >
>> >> > Jan's way ensures that for a view there is always only one current
>> >> version
>> >> > of a doc, since it is using the built-in rev-control. Competing
>> updates
>> >> on
>> >> > the same node may fail which is then what CouchDB is designed to
>> handle.
>> >> If
>> >> > on different nodes, then the rev-control history might come "out of
>> >> synch"
>> >> > via concurrent updates. How does CouchDB handle this? Which update
>> wins?
>> >> On
>> >> > a single node this is intercepted when saving the doc. For multiple
>> nodes
>> >> > they might both get a response saying "save complete". So, these then
>> >> needs
>> >> > merging. How is that done? Jan further on secures the previous version
>> by
>> >> > storing the previous version as a new doc, allowing them to be
>> persisted
>> >> > beyond compaction. I guess Jan's sample would benefit nicely from
>> >> _bulk_docs
>> >> > too. I like this method due to the fact that it allows only one
>> current
>> >> doc.
>> >> > But, I worry about how revision control handles conflicts, Jan?
>> >> >
>> >> > Paul and my updated suggestion always posts new versions, not using
>> the
>> >> > revision system at all. The downside is that there may be multiple
>> >> current
>> >> > versions around... And this is a bit tricky I believe... Anyone?
>> >> >
>> >> > Paul's suggestion also keeps multiple copies of the previous version.
>> I
>> >> am
>> >> > not sure why, Paul?
>> >> >
>> >> >
>> >> > Regards,
>> >> > Ronny
>> >> >
>> >> > 2008/9/17 Paul Davis <[EMAIL PROTECTED]>
>> >> >
>> >> >> Good point chris.
>> >> >>
>> >> >> On Wed, Sep 17, 2008 at 11:39 AM, Chris Anderson <[EMAIL PROTECTED]>
>> >> >> wrote:
>> >> >> > On Wed, Sep 17, 2008 at 11:34 AM, Paul Davis
>> >> >> > <[EMAIL PROTECTED]> wrote:
>> >> >> >> Alternatively something like the following might work:
>> >> >> >>
>> >> >> >> Keep an eye on the specifics of _bulk_docs though. There have been
>> >> >> >> requests to make it non-atomic, but I think in the face of
>> something
>> >> >> >> like this we might make non-atomic _bulk_docs a non-default or
>> some
>> >> >> >> such.
>> >> >> >
>> >> >> > I think the need for non-transaction bulk-docs will be obviated
>> when
>> >> >> > we have the failure response say which docs caused failure, that
>> way
>> >> >> > one can retry once to save all the non-conflicting docs, and then
>> loop
>> >> >> > back through to handle the conflicts.
>> >> >> >
>> >> >> > upshot: I bet you can count on bulk docs being transactional.
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Chris Anderson
>> >> >> > http://jchris.mfdz.com
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: Bulk Load

Reply via email to