[jira] Commented: (COUCHDB-968) Duplicated IDs in _all_docs

Paul Joseph Davis (JIRA) Tue, 14 Dec 2010 08:21:28 -0800

    [ 
https://issues.apache.org/jira/browse/COUCHDB-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971294#action_12971294
 ]


Paul Joseph Davis commented on COUCHDB-968:
-------------------------------------------

Holy complicated-as-shit-algorithm, Batman!

The complexity of our implementation vs the complexity of what we're actually 
doing is starting to worry me here. Perhaps we should consider revisiting 
things to make this easier.

For instance, a simpler algorithm might look like such:

  1. Break each tree into a sorted flat list of child/parent pairs.
  2. Merge sort these lists making sure to pick the appropriate value when 
child/parent values match.
  3. Build the new tree.
  4. Assert the final tree is an actual tree.
  5. Stem

I'm starting to think that maybe once we added merging it broke too many 
assumptions in the tree merge code. While we could think of a stemmed tree as a 
tree with branches that don't exist, its actually become a forest. Merging 
disjoint forests is a bit different than two trees.

Instead of trying to bend the tree merge to our will, I'd vote that we just 
rely on tree's having globally unique keys and use a different algorithm 
altogether.

Thoughts?

> Duplicated IDs in _all_docs
> ---------------------------
>
>                 Key: COUCHDB-968
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-968
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11.1, 0.11.2, 1.0, 1.0.1, 1.0.2
>         Environment: any
>            Reporter: Sebastian Cohnen
>            Assignee: Adam Kocoloski
>            Priority: Blocker
>             Fix For: 0.11.3, 1.0.2, 1.1
>
>
> We have a database, which is causing serious trouble with compaction and 
> replication (huge memory and cpu usage, often causing couchdb to crash b/c 
> all system memory is exhausted). Yesterday we discovered that db/_all_docs is 
> reporting duplicated IDs (see [1]). Until a few minutes ago we thought that 
> there are only few duplicates but today I took a closer look and I found 10 
> IDs which sum up to a total of 922 duplicates. Some of them have only 1 
> duplicate, others have hundreds.
> Some facts about the database in question:
> * ~13k documents, with 3-5k revs each
> * all duplicated documents are in conflict (with 1 up to 14 conflicts)
> * compaction is run on a daily bases
> * several thousands updates per hour
> * multi-master setup with pull replication from each other
> * delayed_commits=false on all nodes
> * used couchdb versions 1.0.0 and 1.0.x (*)
> Unfortunately the database's contents are confidential and I'm not allowed to 
> publish it.
> [1]: Part of http://localhost:5984/DBNAME/_all_docs
> ...
> {"id":"9997","key":"9997","value":{"rev":"6096-603c68c1fa90ac3f56cf53771337ac9f"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> ...
> [*]
> There were two (old) servers (1.0.0) in production (already having the 
> replication and compaction issues). Then two servers (1.0.x) were added and 
> replication was set up to bring them in sync with the old production servers 
> since the two new servers were meant to replace the old ones (to update 
> node.js application code among other things).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-968) Duplicated IDs in _all_docs

Reply via email to