[ 
https://issues.apache.org/jira/browse/COUCHDB-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13924781#comment-13924781
 ] 

Robert Newson commented on COUCHDB-2102:
----------------------------------------

The replicator is more efficient at replicating documents without attachments 
than with. For docs without attachments, it's using _bulk_docs and sending 
hundreds of docs at once. Because it's updating in bulk, it's making less 
trash, almost as if it's building the post-compaction structure directly. For 
docs with attachments, they're written as separate multipart/related requests, 
and more trash is generated.

The true fix is to make the replicator smarter about attachments so that we can 
bulk transfer groups of those.

All this said, compaction still remains necessary (and you can run it at any 
time during the replication). Finally, there's room for improvement in general 
(one of my first tickets was COUCHDB-220 on this general theme).

> Downstream replicator database bloat
> ------------------------------------
>
>                 Key: COUCHDB-2102
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2102
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>          Components: Replication
>            Reporter: Isaac Z. Schlueter
>
> When I do continuous replication from one db to another, I get a lot of bloat 
> over time.
> For example, replicating a _users db with a relatively low level of writes, 
> and around 30,000 documents, the size on disk of the downstream replica was 
> over 300MB after 2 weeks.  I compacted the DB, and the size dropped to about 
> 20MB (slightly smaller than the source database).
> Of course, I realize that I can configure compaction to happen regularly.  
> But this still seems like a rather excessive tax.  It is especially shocking 
> to users who are replicating a 100GB database full of attachments, and find 
> it grow to 400GB if they're not careful!  You can easily end up in a 
> situation where you don't have enough disk space to successfully compact.
> Is there a fundamental reason why this happens?  Or has it simply never been 
> a priority?  It'd be awesome if replication were more efficient with disk 
> space.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to