slothtopus opened a new issue, #5422:
URL: https://github.com/apache/couchdb/issues/5422

   Attachments are deduplicated correctly in a source db's revision tree, but 
are unnecessarily duplicated in a replicated target db's revision tree.
   
   [NOTE]: # ( ^^ Provide a general summary of the issue in the title above. ^^ 
)
   
   ## Description
   
   We have a source db with a single document with a 1MB attachment on revision 
1. Our source db is 1MB in size. We replicate this to a new target db. The 
target db is now also 1MB in size.
   
   We make a small update in our doc in the source db. We use {stub: true} in 
the _attachments section to avoid having to include the attachment once again. 
This creates revision 2 and our source db remains 1MB in size.
   
   We replicate revision 2 to the same target db. After replication the target 
db is now 2MB in size.
   
   If we repeat the process the source db remains roughly the same size, but 
the target db keeps growing in size.
   
   [NOTE]: # ( Describe the problem you're encountering. )
   [TIP]:  # ( Do NOT give us access or passwords to your actual CouchDB! )
   
   ## Steps to Reproduce
   ```
   # Create test DB
   curl --request PUT 'https://source-db/test'
   # {ok: true}
   
   # create doc with large attachment
   echo '{
       "count": 1,
       "_attachments": {
           "1mb_of_text.txt": {
               "data": "'"$(base64 -i 1mb_of_text.txt)"'",
               "content_type": "plain/text"
           }
       }
   }' > payload.json
   curl --request POST 'https://source-db/test' \
   --header 'Content-Type: application/json' \
   --data-binary @payload.json
   # 
{"ok":true,"id":"c2561fa83ffe5658a60b6ee757000e49","rev":"1-21f33fa9244958ce3dff0c8baf74626e"}
   
   # Replicate to target
   curl --request POST 'https://source-db/_replicate' \
   --header 'Content-Type: application/json' \
   --data-raw '{
       "source": "https://source-db/test";,
       "target": "https://target-db/test";
       "create_target": true
   }'
   # {ok: true .... }
   
   # Source and target DBs are both roughly 1MB
   curl --request GET 'https://source-db/test'
   # "sizes":{"file":1073505,"external":1048586,"active":1049567}
   curl --request GET 'https://target-db/test'
   # "sizes":{"file":1073511,"external":1048586,"active":1051579}
   
   # Update 'count' in doc
   curl --request POST 'https://source-db/test' \
   --header 'Content-Type: application/json' \
   --data-raw '{
     "_id": "c2561fa83ffe5658a60b6ee757000e49",
     "_rev": "1-21f33fa9244958ce3dff0c8baf74626e",
     "count": 2,
     "_attachments": {
       "1mb_of_text.txt": {
         "content_type": "plain/text",
         "revpos": 1,
         "digest": "md5-XM5egJ+FKBPXlsK8GS03zA==",
         "length": 1048575,
         "stub": true
       }}}'
   # 
{"ok":true,"id":"c2561fa83ffe5658a60b6ee757000e49","rev":"2-630add8767fd1941c0de790cf6b41ec3"}
   
   # Replicate again to target
   curl --request POST 'https://source-db/_replicate' \
   --header 'Content-Type: application/json' \
   --data-raw '{
       "source": "https://source-db/test";,
       "target": "https://target-db/test";
       "create_target": true
   }'
   # {ok: true .... }
   
   # Source DB remains roughly same size; target DB doubles in size
   curl --request GET 'https://source-db/test'
   # "sizes":{"file":1081697,"external":1048586,"active":1049814}
   curl --request GET 'https://target-db/test'
   # "sizes":{"file":2134375,"external":2097161,"active":2101837}
   ```
   
   [NOTE]: # ( Include commands to reproduce, if possible. curl is preferred. )
   
   ## Expected Behaviour
   
   I'm not sure if this is expected behaviour, but it seems quite undesirable. 
I would expect the same deduplication of attachments to be preserved in 
replication. Including attachments unnecessarily in replication for every small 
update seems very inefficient. Plus this opens the door for edge cases where we 
could easily run out of disk.
   
   [NOTE]: # ( Tell us what you expected to happen. )
   
   ## Your Environment
   
   ```json
   
{"couchdb":"Welcome","version":"3.4.2","git_sha":"6e5ad2a5c","uuid":"4b35be14e7a0b9e04dae5320e33f0c76","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"],"vendor":{"name":"The
 Apache Software Foundation"}}
   ```
   
   [TIP]:  # ( Include as many relevant details about your environment as 
possible. )
   [TIP]:  # ( You can paste the output of curl http://YOUR-COUCHDB:5984/ here. 
)
   
   * CouchDB version used: 3.4.2
   * Browser name and version: n/a
   * Operating system and version: mac OS Ventura
   
   ## Additional Context
   
   [TIP]:  # ( Add any other context about the problem here. )
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to