[jira] [Commented] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Filipe Manana (JIRA) Mon, 23 May 2011 04:39:32 -0700

    [ 
https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037900#comment-13037900
 ]


Filipe Manana commented on COUCHDB-1076:
----------------------------------------

Looks good Randall, only 2 minor comments:

1) Do you have to add the skip function to couch_db_updater and export it if 
it's only used in couch_httpd_db.erl?

2) The skip function doesn't account for db reduction values created by 1.1.x 
and older releases, which are 2 tuple terms {NotDeleted, Deleted}; So either 
another clause should be added or simply do a  "Skip >= element(1, Reduction)"

Do you think there's a sane way do add some etap tests (to 020-btree-basics.t)? 
At least for the trivial cases it's easy to test (all branches skipped, or none 
of the branches is skipped)

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Bug
>            Reporter: Nathan Vander Wilt
>
> The time required to query _all_docs?limit=1 can be proportional to the 
> doc_del_count of the database depending on where the deleted docs are in the 
> sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for 
> each document, deleted or not ... if you have a large number of deleted docs 
> and their IDs are interspersed with the non-deleted ones i can imagine that 
> it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a 
> rolling log or any long-lived database), the deleted docs may all be at the 
> beginning of the _all_docs view, making query performance end up like using 
> "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been 
> deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the 
> doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the 
> btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and 
> skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably 
> requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be 
> possible? (checking for doc_count=0 in the inner node reduction and then 
> skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how 
> to do it so that I don't bleed from the eyes when trying to read through the 
> implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high 
> readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a 
> function to the view iteration that gets called to evaluate whether it should 
> decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) 
> efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Reply via email to