Last week I had a Slack call with Rodric around AttachmentStore PR and
as part of that we discussed the problem around handling of concurrent
updates of attachments. Details below are based on that discussion.
As of now CouchDB can detect concurrent updates of attachment due to
inbuilt MVCC support. However most of the Object Stores (like S3/IBM
COS, Azure Blob Store etc) which are to be used for new
AttachmentStore implementation does not provide any conditional update
and are designed more for immutable storage.
Consider an Action Update sequence which is currently done in 2 parts
1. Update the document
2. Upload the attachment
Now consider an AttachmentStore implementation (as per PR #3453
design) which stores attachment content against a key like
whiskentity/<doc id>/<attachment name>
Where
1. whiskaction - Key prefix to store attachments related to Whisk entities
2. <doc id> - Document Id with which the attachment is being attached
3. <attachment name> - Name of attachment like `jarfile`
Object Stores are optimized for direct key lookup and also allows
searches based on key prefix. Hence the use of such a format which
allows direct attachment lookup for readAttachment and all attachments
related to specific doc for deleteAttachments
Now consider following flow
1. thread 1: updates the document and succeeds
2. thread 2: updates the document (based on thread 1) and succeeds
3. thread 2: attaches i.e. writes an attachment to the AttachmentStore
4. thread 1: attaches
This would result in a race condition where in the end attachment
meant for document state at #1 gets linked to document at state #2. To
handle such cases we should switch to immutable attachment design
A - Proposal - Use Immutable Attachments
----------------------------------------------------
In current flow we perform an "update" of existing attachment with a
given name. For e.g. currently action update flow is like
1. Put document with attachment info
"exec": {
"kind": "java",
"code": {
"attachmentName": "jarfile",
"attachmentType": "application/java-archive"
},
"binary": true,
"main": "Hello"
}
2. Attach the attachment with name set to value of `attachmentName`
Instead of that we should allow `ArtifactStore` (which in turn rely on
AttachmentStore) to generate the name and then save that name against
`attachmentName`. So proposed flow is
1. Upload the attachment and have ArtifactStore return a generated name
protected[core] def attach(doc: DocInfo, contentType: ContentType,
docStream: Source[ByteString, _])(
implicit transid: TransactionId): Future[(DocInfo, AttachmentName)]
2. Then update the document with attachmentName set to name returned
in previous step
3. Then delete the old attachment after #2 completes successfully
With this approach the attachments would be immutable and that would enable
1. Proper handling of concurrent updates
2. Simplified caching of attachments as immutable objects can be cached easily
B - Orphaned Blob Garbage Collection
----------------------------------------------
With above approach there is a possiblity that some action update flow
may end up in between leaving some orphan blob instances in Object
stores. To clean them up we can implement a garbage collection login
as part of wskadmin
Please share your feedback about the new proposal. I would start work
on a PR for new proposal so that its easier to discuss specific
semantics. Once this work is done we can come back to AttachmentStore
PR and implement that as per newer flow
Chetan Mehrotra