[jira] Commented: (COUCHDB-968) Duplicated IDs in _all_docs

Adam Kocoloski (JIRA) Tue, 30 Nov 2010 07:28:33 -0800

    [ 
https://issues.apache.org/jira/browse/COUCHDB-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965258#action_12965258
 ]


Adam Kocoloski commented on COUCHDB-968:
----------------------------------------

@davisp

> Pre compaction in _changes I would expect the same _revision (I think, just 
> guessing) because it's just iterating the by_seqid_btree and then displaying 
> the update_seq from the actual #full_doc_info (I think, just guessing).

Nope, that's not how _changes works.  It walks the seq tree and displays the 
high_seq from the #doc_info record stored there.  The #full_doc_info from the 
id tree is not involved.  That's why the duplicate entries for a given document 
in a _changes response have different "seq" values before compaction.

On the other hand, compaction does something more like what you described - it 
grabs the #full_doc_info from the ID tree and constructs a #doc_info from it.  
When compacting a database with duplicates in the seq tree, it grabs the same 
#full_doc_info from the id tree multiple times, and each time contructs a new 
(identical) #doc_info record to insert into the compacted seq tree.  This 
explains why the _changes response looks different before and after compaction.

Stop me if I'm not making sense.  This part of the issue is very clear in my 
head, but I may not be explaining it well.  For reference, here are the results 
I'm trying to explain:

$ curl localhost:5984/db1/_changes
{"results":[
{"seq":3,"id":"_design/update","changes":[{"rev":"1-18867805c5d826b6d58312e4430e40fe"}]},
{"seq":9,"id":"foo","changes":[{"rev":"7-c660ea7a73efa1b9f727146ef7ca71ed"}]},
{"seq":21,"id":"foo","changes":[{"rev":"13-dde4cd2d68f911fe27bd62c6c4aec0ed"}]},
{"seq":34,"id":"foo","changes":[{"rev":"19-71621b918e86377e61618feeaee48a74"}]}
],
"last_seq":34}

$ curl localhost:5984/db1/_compact -d '{}' -Hcontent-type:application/json
{"ok":true}

$ curl localhost:5984/db1/_changes
{"results":[
{"seq":3,"id":"_design/update","changes":[{"rev":"1-18867805c5d826b6d58312e4430e40fe"}]},
{"seq":34,"id":"foo","changes":[{"rev":"19-71621b918e86377e61618feeaee48a74"}]},
{"seq":34,"id":"foo","changes":[{"rev":"19-71621b918e86377e61618feeaee48a74"}]},
{"seq":34,"id":"foo","changes":[{"rev":"19-71621b918e86377e61618feeaee48a74"}]}
],
"last_seq":34}

At any rate, I definitely agree that the core issue is in merge_rev_trees and 
stem.  However, I think that any databases which are stuck with duplicates will 
not have them removed by solution #1.  I think forcing a compaction in "retry" 
mode would repair them, though.

@bitdiddle I noticed that running of the update_seq too.  I guess the db2 -> 
db1 replicator is bouncing the update_seq on db1 even when nothing changes.  
That may be a separate low-priority bug, or it may be central to the problem.  
Not sure yet.

> Duplicated IDs in _all_docs
> ---------------------------
>
>                 Key: COUCHDB-968
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-968
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11.1, 0.11.2, 1.0, 1.0.1, 1.0.2
>         Environment: Ubuntu 10.04.
>            Reporter: Sebastian Cohnen
>            Priority: Blocker
>
> We have a database, which is causing serious trouble with compaction and 
> replication (huge memory and cpu usage, often causing couchdb to crash b/c 
> all system memory is exhausted). Yesterday we discovered that db/_all_docs is 
> reporting duplicated IDs (see [1]). Until a few minutes ago we thought that 
> there are only few duplicates but today I took a closer look and I found 10 
> IDs which sum up to a total of 922 duplicates. Some of them have only 1 
> duplicate, others have hundreds.
> Some facts about the database in question:
> * ~13k documents, with 3-5k revs each
> * all duplicated documents are in conflict (with 1 up to 14 conflicts)
> * compaction is run on a daily bases
> * several thousands updates per hour
> * multi-master setup with pull replication from each other
> * delayed_commits=false on all nodes
> * used couchdb versions 1.0.0 and 1.0.x (*)
> Unfortunately the database's contents are confidential and I'm not allowed to 
> publish it.
> [1]: Part of http://localhost:5984/DBNAME/_all_docs
> ...
> {"id":"9997","key":"9997","value":{"rev":"6096-603c68c1fa90ac3f56cf53771337ac9f"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> ...
> [*]
> There were two (old) servers (1.0.0) in production (already having the 
> replication and compaction issues). Then two servers (1.0.x) were added and 
> replication was set up to bring them in sync with the old production servers 
> since the two new servers were meant to replace the old ones (to update 
> node.js application code among other things).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-968) Duplicated IDs in _all_docs

Reply via email to