Github user davisp commented on the pull request:
https://github.com/apache/couchdb-couch/pull/24#issuecomment-67239293
That looks pretty good but there's still an issue with the counting of
active size. Unfortunately I think the fix is going to require us to start
using @strmpnk's new attachments module which appears to introduce some new
issues once we start using extended attributes.
The issue is that when we update documents we don't know if the doc on disk
is de-duplicated or not. So if we go to edit a doc that has a de-duplciated
attachment will suddenly start double counting the bytes on disk which breaks
our active size information. A test case would be something like this:
1. Create doc A with attachment
2. Create doc B with same attachment that will be de-duped
3. Compact the database
4. Check our active size
5. Modify doc B (in a way that keeps the de-duped attachment without
copying it)
6. Check that active size isn't increased by more than the attachment
size
I'd suggest making sure that the attachment is something like 64K so that
its obvious if we accidentally started double counting (modifying doc B should
increase active size slightly by definition).
The fix here I think is to use the couch_att module's ability to store
extended attributes that shows if the attachment was deduplicated.
Theoretically wrapping up some of the compaction code will clean up that logic
anyway so that's a good thing.
Unfortunately it looks like we haven't yet started accounting for
attachments with extended attributes in the compactor. Our current attachment
terms are those unwiedly 8-tuples. Anything with extended attributes will turn
that into a 2-tuple with the first element being the 8-tuple, and the second
element being a proplist of key/value pairs for the attachment. Theoretically
we should be able to use this to store if the attachment was de-duplicated at
compaction time but we'll have to make sure it works with compactions and
upgrades which it appears hasn't been completely vetted yet.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---