[james-project] branch master updated: [ADR] Deduplicated blob garbage collection with Bloom Filters (#594)

btellier Wed, 18 Aug 2021 20:32:31 -0700

This is an automated email from the ASF dual-hosted git repository.

btellier pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/james-project.git



The following commit(s) were added to refs/heads/master by this push:
     new 5f3405d  [ADR] Deduplicated blob garbage collection with Bloom Filters 
(#594)
5f3405d is described below

commit 5f3405dee4c8492743f5b26bf22f85a2bcd45d39
Author: Tellier Benoit <[email protected]>
AuthorDate: Thu Aug 19 10:32:17 2021 +0700

    [ADR] Deduplicated blob garbage collection with Bloom Filters (#594)
---
 src/adr/0039-distributed-blob-garbage-collector.md |   2 +
 ...049-deduplicated-blobs-gs-with-bloom-filters.md | 110 +++++++++++++++++++++
 2 files changed, 112 insertions(+)

diff --git a/src/adr/0039-distributed-blob-garbage-collector.md 
b/src/adr/0039-distributed-blob-garbage-collector.md
index 29d422b..ace4b0a 100644
--- a/src/adr/0039-distributed-blob-garbage-collector.md
+++ b/src/adr/0039-distributed-blob-garbage-collector.md
@@ -8,6 +8,8 @@ Proposed, not implemented yet.
 
 Work had been started on this topic.
 
+An alternative is proposed in [Deduplicated blobs GC with bloom 
filters](0049-deduplicated-blobs-gs-with-bloom-filters.md)
+
 ## Context
 
 The body, headers, attachments of the mails are stored as blobs in a blob 
store.
diff --git a/src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md 
b/src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
new file mode 100644
index 0000000..0fd52ca
--- /dev/null
+++ b/src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
@@ -0,0 +1,110 @@
+# 49. Deduplicated blobs garbage collection with bloom filters
+
+Date: 2021-07-21
+
+## Status
+
+Accepted (lazy consensus).
+
+Not yet implemented.
+
+Proposes a simpl waye to implement an alternative to  [39. Distributed blob 
garbage collector](0039-distributed-blob-garbage-collector.md)
+
+## Context
+
+The body, headers, attachments of the mails are stored as blobs in a blob 
store.
+
+In order to save space in those stores, those blobs are de-duplicated using a 
hash of their content.
+
+To attain that the current blob store will read the content of the blob before 
saving it, and generate its id based on
+a hash of this content. This way two blobs with the same content will share 
the same id and thus be saved only once.
+
+This makes the safe deletion of one of those blobs a non-trivial problem as we 
can't delete one blob without ensuring
+that all references to it are themselves deleted. For example if two messages 
share the same blob, when we delete
+one message there is at the time being no way to tell if the blob is still 
referenced by another message.
+
+## Decision
+
+To solve this, we will propose a simple two steps algorithms to provide a 
background deduplication job.
+
+The **first step** consists in building a bloom filter using the entities 
referencing the blobs.
+
+In the **second step** we iterate over blobs and check in the bloom filter to 
predict if they are referenced on not.
+
+**Bloom filters** are probabilistic data structures. Here the reference 
prediction can produce false positives: we might 
+skip some non referenced blobs that should have been garbage collected. 
However, the associated probability can be tuned 
+and by adding a salt we can ensure subsequent runs will have different sets of 
false positives and thus that all blobs is
+eventually garbage collected.
+
+To avoid concurrency issues, where we could garbage collect a blob at the same 
time a new reference to it appear,
+a `reference generation` notion will be added. The de-duplicating id of the 
blobs which before where constructed
+using only the hash of their content,  will now include this `reference 
generation` too. To avoid synchronization 
+issues, the `generation` will be time based.
+
+So only blobs belonging to the `reference generation` `n-2` will be eligible 
for garbage collection to avoid 
+concurrency issues, and allow for a clock skew.
+
+Finally, we wish to offer the opportunity to configure, and reconfigure, the 
`generation` duration. In order to do so,
+we introduce a `generation family` part in the blobId. Incremented by the 
administrator on each configuration changes on
+the generation duration it allows avoiding conflicts in generations getting 
the same number before and after the change:
+all blobIds with a different family are considered belonging to a distinct 
generation ready to be garbage collected. This
+allows arbitrary changes in the generation duration.
+
+## Consequences
+
+We need to be able to list blobIds of blobs within the BlobStore. This 
operation should be supported by the blobStore 
+to prevent deduplication issues.
+
+We need to introduce a new API: `BlobReferenceSource` listing the blobIds 
currently referenced by an entity. We will
+need to implement it for each entity: Attachment, messages, mail queue and 
mail repositories.
+
+The garbage collection is a heavy-weight task whose complexity is proportional 
to the dataset size. 
+
+Regarding the garbage collection generation duration tradeoff:
+ - The longer, the longer it will take to effectively delete blobs
+ - However, the longer, the more efficient deduplication will be 
(deduplication is of course scoped to a single 
+   generation).
+ - Generation duration does not impact the overall process speed.
+ 
+Please note that this design does not require additional responsibilities for 
blobStore user and is thus fully transparent 
+to them.
+ 
+## Alternatives
+
+### Iterative approach
+
+[39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md) proposes an alternative 
that could 
+be implemented in the future and attempts to reduce the duration of each 
iteration.
+
+However, it presents the following flaws:
+ - Generation switch is synchronous
+ - Needs to keep track of deletions
+ - Needs to keep track of references
+ 
+Overall consistency in the face of cassandra failures to store any of the 
above had not been studied carefully.
+
+The need to track references implies that this algorithm is not transparent to 
blob store users and requires either 
+blobStore API modifications or users to actively track references. This is 
likely to not be transparent to blob store 
+users.
+
+Also, the proof of concept yield full data of a generation in memory, and thus 
this could have an impact on scalability.
+
+As such we believe that a simpler approach that could be implemented timely 
yield some benefits.
+
+This alternative can be implemented later, once the limits of the bloom filter 
approach are better known.
+
+The two algorithms are not mutually exclusive and could very well coexist if 
need be in the same codebase.
+
+### Implementation of the Bloom Filter approach with Apache Spark
+
+An other way to be solving potential scaling issues would be to run the Bloom 
Filter algorithm on a Spark cluster.
+Apache Spark would then take care of scheduling the job in a distributed 
fashion, allowing to scale the count of
+worker nodes.
+
+Requiring minimal applicative knowledge, this would be a very good candidate 
for such a tooling.
+
+## References
+
+ - [JIRA](https://issues.apache.org/jira/browse/JAMES-3150)
+ - [PR of this ADR](https://github.com/apache/james-project/pull/594)
+ - [Thread on server-dev mailing 
list](https://www.mail-archive.com/[email protected]/msg70734.html)
\ No newline at end of file

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[james-project] branch master updated: [ADR] Deduplicated blob garbage collection with Bloom Filters (#594)

Reply via email to