[GitHub] [james-project] Arsnael commented on a change in pull request #594: ADR Deduplicated blob garbage collection with Bloom Filters

GitBox Fri, 13 Aug 2021 01:00:58 -0700


Arsnael commented on a change in pull request #594:
URL: https://github.com/apache/james-project/pull/594#discussion_r688317631




##########
File path: src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
##########
@@ -0,0 +1,110 @@
+# 49. Deduplicated blobs garbage collection with bloom filters
+
+Date: 2021-07-21
+
+## Status
+
+Accepted (lazy consensus).
+
+Not yet implemented.
+
+Proposes a simple to implement alternative to  [39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md)

Review comment:
       ```suggestion
   Proposes a simple eay to implement an alternative to  [39. Distributed blob 
garbage collector](0039-distributed-blob-garbage-collector.md)
   ```

##########
File path: src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
##########
@@ -0,0 +1,110 @@
+# 49. Deduplicated blobs garbage collection with bloom filters
+
+Date: 2021-07-21
+
+## Status
+
+Accepted (lazy consensus).
+
+Not yet implemented.
+
+Proposes a simple to implement alternative to  [39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md)
+
+## Context
+
+The body, headers, attachments of the mails are stored as blobs in a blob 
store.
+
+In order to save space in those stores, those blobs are de-duplicated using a 
hash of their content.
+
+To attain that the current blob store will read the content of the blob before 
saving it, and generate its id based on
+a hash of this content. This way two blobs with the same content will share 
the same id and thus be saved only once.
+
+This makes the safe deletion of one of those blobs a non-trivial problem as we 
can't delete one blob without ensuring
+that all references to it are themselves deleted. For example if two messages 
share the same blob, when we delete
+one message there is at the time being no way to tell if the blob is still 
referenced by another message.
+
+## Decision
+
+To solve this, we will propose a simple two steps algorithms to provide a 
background deduplication job.
+
+The **first step** consists in building a bloom filter using the entities 
referencing the blobs.
+
+In the **second step** we iterate over blobs and check in the bloom filter to 
predict if they are referenced on not.
+
+**Bloom filters** are probabilistic data structures. Here the reference 
prediction can produce false positives: we might 
+skip some non referenced blobs that should have been garbage collected. 
However, the associated probability can be tuned 
+and by adding a salt we can ensure subsequent runs will have different sets of 
false positives and thus that all blobs is
+eventually garbage collected.
+
+To avoid concurrency issues, where we could garbage collect a blob at the same 
time a new reference to it appear,
+a `reference generation` notion will be added. The de-duplicating id of the 
blobs which before where constructed
+using only the hash of their content,  will now include this `reference 
generation` too. To avoid synchronization 
+issues, the `generation` will be time based.
+
+So only blobs belonging to the `reference generation` `n-2` will be eligible 
for garbage collection to avoid 
+concurrency issues, and allow for a clock skew.
+
+Finally, we wish to offer the opportunity to configure, and reconfigure, the 
`generation` duration. In order to do so,
+we introduce a `generation family` part in the blobId. Incremented by the 
administrator on each configuration changes on
+the generation duration it allows avoiding conflicts in generations getting 
the same number before and after the change:
+all blobIds with a different family are considered belonging to a distinct 
generation ready to be garbage collected. This
+allows arbitrary changes in the generation duration.
+
+## Consequences
+
+We need to be able to list blobIds of blobs within the BlobStore. This 
operation should be supported by the blobStore 
+to prevent deduplication issues.
+
+We need to introduce a new API: `BlobReferenceSource` listing the blobIds 
currently referenced by an entity. We will
+need to implement it for each entity: Attachment, messages, mail queue and 
mail repositories.
+
+The garbage collection is a heavy-weight task whose complexity is proportional 
to the dataset size. 
+
+Regarding the garbage collection generation duration tradeoff:
+ - The longer, the longer it will take to effectively delete blobs
+ - However, the longer, the more efficient deduplication will be 
(deduplication is of course scoped to a single 
+   generation).
+ - Generation duration do not impact the overall process speed.

Review comment:
       ```suggestion
    - Generation duration does not impact the overall process speed.
   ```

##########
File path: src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
##########
@@ -0,0 +1,110 @@
+# 49. Deduplicated blobs garbage collection with bloom filters
+
+Date: 2021-07-21
+
+## Status
+
+Accepted (lazy consensus).
+
+Not yet implemented.
+
+Proposes a simple to implement alternative to  [39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md)
+
+## Context
+
+The body, headers, attachments of the mails are stored as blobs in a blob 
store.
+
+In order to save space in those stores, those blobs are de-duplicated using a 
hash of their content.
+
+To attain that the current blob store will read the content of the blob before 
saving it, and generate its id based on
+a hash of this content. This way two blobs with the same content will share 
the same id and thus be saved only once.
+
+This makes the safe deletion of one of those blobs a non-trivial problem as we 
can't delete one blob without ensuring
+that all references to it are themselves deleted. For example if two messages 
share the same blob, when we delete
+one message there is at the time being no way to tell if the blob is still 
referenced by another message.
+
+## Decision
+
+To solve this, we will propose a simple two steps algorithms to provide a 
background deduplication job.
+
+The **first step** consists in building a bloom filter using the entities 
referencing the blobs.
+
+In the **second step** we iterate over blobs and check in the bloom filter to 
predict if they are referenced on not.
+
+**Bloom filters** are probabilistic data structures. Here the reference 
prediction can produce false positives: we might 
+skip some non referenced blobs that should have been garbage collected. 
However, the associated probability can be tuned 
+and by adding a salt we can ensure subsequent runs will have different sets of 
false positives and thus that all blobs is
+eventually garbage collected.
+
+To avoid concurrency issues, where we could garbage collect a blob at the same 
time a new reference to it appear,
+a `reference generation` notion will be added. The de-duplicating id of the 
blobs which before where constructed
+using only the hash of their content,  will now include this `reference 
generation` too. To avoid synchronization 
+issues, the `generation` will be time based.
+
+So only blobs belonging to the `reference generation` `n-2` will be eligible 
for garbage collection to avoid 
+concurrency issues, and allow for a clock skew.
+
+Finally, we wish to offer the opportunity to configure, and reconfigure, the 
`generation` duration. In order to do so,
+we introduce a `generation family` part in the blobId. Incremented by the 
administrator on each configuration changes on
+the generation duration it allows avoiding conflicts in generations getting 
the same number before and after the change:
+all blobIds with a different family are considered belonging to a distinct 
generation ready to be garbage collected. This
+allows arbitrary changes in the generation duration.
+
+## Consequences
+
+We need to be able to list blobIds of blobs within the BlobStore. This 
operation should be supported by the blobStore 
+to prevent deduplication issues.
+
+We need to introduce a new API: `BlobReferenceSource` listing the blobIds 
currently referenced by an entity. We will
+need to implement it for each entity: Attachment, messages, mail queue and 
mail repositories.
+
+The garbage collection is a heavy-weight task whose complexity is proportional 
to the dataset size. 
+
+Regarding the garbage collection generation duration tradeoff:
+ - The longer, the longer it will take to effectively delete blobs
+ - However, the longer, the more efficient deduplication will be 
(deduplication is of course scoped to a single 
+   generation).
+ - Generation duration do not impact the overall process speed.
+ 
+Please note that this design do not require additional responsibilities for 
blobStore user and is thus fully transparent 
+to them.
+ 
+## Alternatives
+
+### Iterative approach
+
+[39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md) proposes an alternative 
that could 
+be implemented in the future and attempts to reduce the duration of each 
iteration.
+
+However, it presents the following flows:

Review comment:
       ```suggestion
   However, it presents the following flaws:
   ```

##########
File path: src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
##########
@@ -0,0 +1,110 @@
+# 49. Deduplicated blobs garbage collection with bloom filters
+
+Date: 2021-07-21
+
+## Status
+
+Accepted (lazy consensus).
+
+Not yet implemented.
+
+Proposes a simple to implement alternative to  [39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md)
+
+## Context
+
+The body, headers, attachments of the mails are stored as blobs in a blob 
store.
+
+In order to save space in those stores, those blobs are de-duplicated using a 
hash of their content.
+
+To attain that the current blob store will read the content of the blob before 
saving it, and generate its id based on
+a hash of this content. This way two blobs with the same content will share 
the same id and thus be saved only once.
+
+This makes the safe deletion of one of those blobs a non-trivial problem as we 
can't delete one blob without ensuring
+that all references to it are themselves deleted. For example if two messages 
share the same blob, when we delete
+one message there is at the time being no way to tell if the blob is still 
referenced by another message.
+
+## Decision
+
+To solve this, we will propose a simple two steps algorithms to provide a 
background deduplication job.
+
+The **first step** consists in building a bloom filter using the entities 
referencing the blobs.
+
+In the **second step** we iterate over blobs and check in the bloom filter to 
predict if they are referenced on not.
+
+**Bloom filters** are probabilistic data structures. Here the reference 
prediction can produce false positives: we might 
+skip some non referenced blobs that should have been garbage collected. 
However, the associated probability can be tuned 
+and by adding a salt we can ensure subsequent runs will have different sets of 
false positives and thus that all blobs is
+eventually garbage collected.
+
+To avoid concurrency issues, where we could garbage collect a blob at the same 
time a new reference to it appear,
+a `reference generation` notion will be added. The de-duplicating id of the 
blobs which before where constructed
+using only the hash of their content,  will now include this `reference 
generation` too. To avoid synchronization 
+issues, the `generation` will be time based.
+
+So only blobs belonging to the `reference generation` `n-2` will be eligible 
for garbage collection to avoid 
+concurrency issues, and allow for a clock skew.
+
+Finally, we wish to offer the opportunity to configure, and reconfigure, the 
`generation` duration. In order to do so,
+we introduce a `generation family` part in the blobId. Incremented by the 
administrator on each configuration changes on
+the generation duration it allows avoiding conflicts in generations getting 
the same number before and after the change:
+all blobIds with a different family are considered belonging to a distinct 
generation ready to be garbage collected. This
+allows arbitrary changes in the generation duration.
+
+## Consequences
+
+We need to be able to list blobIds of blobs within the BlobStore. This 
operation should be supported by the blobStore 
+to prevent deduplication issues.
+
+We need to introduce a new API: `BlobReferenceSource` listing the blobIds 
currently referenced by an entity. We will
+need to implement it for each entity: Attachment, messages, mail queue and 
mail repositories.
+
+The garbage collection is a heavy-weight task whose complexity is proportional 
to the dataset size. 
+
+Regarding the garbage collection generation duration tradeoff:
+ - The longer, the longer it will take to effectively delete blobs
+ - However, the longer, the more efficient deduplication will be 
(deduplication is of course scoped to a single 
+   generation).
+ - Generation duration do not impact the overall process speed.
+ 
+Please note that this design do not require additional responsibilities for 
blobStore user and is thus fully transparent 
+to them.
+ 
+## Alternatives
+
+### Iterative approach
+
+[39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md) proposes an alternative 
that could 
+be implemented in the future and attempts to reduce the duration of each 
iteration.
+
+However, it presents the following flows:
+ - Generation switch is synchronous
+ - Needs to keep track of deletions
+ - Needs to keep track of references
+ 
+Overall consistency in the face of cassandra failures to store any of the 
above had not been studied carefully.
+
+The need to track references implies that this algorithm is not transparent to 
blob store users and requires either 
+blobStore API modifications or users to actively track references. This is 
likely to not be transparent to blob store 
+users.
+
+Also, the proof of concept yield full data of a generation in memory, and thus 
this could have an impact on scalability.
+
+As such we believe that a simpler approach that could be implemented timely 
yield some benefits.
+
+This alternative can be implemented later, once the limits of the bloom filter 
approach are better known.
+
+The two algorithms are not mutually exclusive and could very well coexist if 
need be in the same codebase.
+
+### Implementation of the Bloom Filter approach with Apache Spark
+
+On other way to be solving potential scaling issues would be to run the Bloom 
Filter algorithm on a SPark cluster.

Review comment:
       ```suggestion
   An other way to be solving potential scaling issues would be to run the 
Bloom Filter algorithm on a Spark cluster.
   ```

##########
File path: src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
##########
@@ -0,0 +1,110 @@
+# 49. Deduplicated blobs garbage collection with bloom filters
+
+Date: 2021-07-21
+
+## Status
+
+Accepted (lazy consensus).
+
+Not yet implemented.
+
+Proposes a simple to implement alternative to  [39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md)
+
+## Context
+
+The body, headers, attachments of the mails are stored as blobs in a blob 
store.
+
+In order to save space in those stores, those blobs are de-duplicated using a 
hash of their content.
+
+To attain that the current blob store will read the content of the blob before 
saving it, and generate its id based on
+a hash of this content. This way two blobs with the same content will share 
the same id and thus be saved only once.
+
+This makes the safe deletion of one of those blobs a non-trivial problem as we 
can't delete one blob without ensuring
+that all references to it are themselves deleted. For example if two messages 
share the same blob, when we delete
+one message there is at the time being no way to tell if the blob is still 
referenced by another message.
+
+## Decision
+
+To solve this, we will propose a simple two steps algorithms to provide a 
background deduplication job.
+
+The **first step** consists in building a bloom filter using the entities 
referencing the blobs.
+
+In the **second step** we iterate over blobs and check in the bloom filter to 
predict if they are referenced on not.
+
+**Bloom filters** are probabilistic data structures. Here the reference 
prediction can produce false positives: we might 
+skip some non referenced blobs that should have been garbage collected. 
However, the associated probability can be tuned 
+and by adding a salt we can ensure subsequent runs will have different sets of 
false positives and thus that all blobs is
+eventually garbage collected.
+
+To avoid concurrency issues, where we could garbage collect a blob at the same 
time a new reference to it appear,
+a `reference generation` notion will be added. The de-duplicating id of the 
blobs which before where constructed
+using only the hash of their content,  will now include this `reference 
generation` too. To avoid synchronization 
+issues, the `generation` will be time based.
+
+So only blobs belonging to the `reference generation` `n-2` will be eligible 
for garbage collection to avoid 
+concurrency issues, and allow for a clock skew.
+
+Finally, we wish to offer the opportunity to configure, and reconfigure, the 
`generation` duration. In order to do so,
+we introduce a `generation family` part in the blobId. Incremented by the 
administrator on each configuration changes on
+the generation duration it allows avoiding conflicts in generations getting 
the same number before and after the change:
+all blobIds with a different family are considered belonging to a distinct 
generation ready to be garbage collected. This
+allows arbitrary changes in the generation duration.
+
+## Consequences
+
+We need to be able to list blobIds of blobs within the BlobStore. This 
operation should be supported by the blobStore 
+to prevent deduplication issues.
+
+We need to introduce a new API: `BlobReferenceSource` listing the blobIds 
currently referenced by an entity. We will
+need to implement it for each entity: Attachment, messages, mail queue and 
mail repositories.
+
+The garbage collection is a heavy-weight task whose complexity is proportional 
to the dataset size. 
+
+Regarding the garbage collection generation duration tradeoff:
+ - The longer, the longer it will take to effectively delete blobs
+ - However, the longer, the more efficient deduplication will be 
(deduplication is of course scoped to a single 
+   generation).
+ - Generation duration do not impact the overall process speed.
+ 
+Please note that this design do not require additional responsibilities for 
blobStore user and is thus fully transparent 
+to them.
+ 
+## Alternatives
+
+### Iterative approach
+
+[39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md) proposes an alternative 
that could 
+be implemented in the future and attempts to reduce the duration of each 
iteration.
+
+However, it presents the following flows:
+ - Generation switch is synchronous
+ - Needs to keep track of deletions
+ - Needs to keep track of references
+ 
+Overall consistency in the face of cassandra failures to store any of the 
above had not been studied carefully.
+
+The need to track references implies that this algorithm is not transparent to 
blob store users and requires either 
+blobStore API modifications or users to actively track references. This is 
likely to not be transparent to blob store 
+users.
+
+Also, the proof of concept yield full data of a generation in memory, and thus 
this could have an impact on scalability.
+
+As such we believe that a simpler approach that could be implemented timely 
yield some benefits.
+
+This alternative can be implemented later, once the limits of the bloom filter 
approach are better known.
+
+The two algorithms are not mutually exclusive and could very well coexist if 
need be in the same codebase.
+
+### Implementation of the Bloom Filter approach with Apache Spark
+
+On other way to be solving potential scaling issues would be to run the Bloom 
Filter algorithm on a SPark cluster.
+Apache Spark would then take care of scheduling the job in a distributed 
fashion, allowing to scale the count of
+worker nodes.
+
+Requiring minimal applicative knowledge, this would be a very candidate for 
such a tooling.

Review comment:
       ```suggestion
   Requiring minimal applicative knowledge, this would be a very good candidate 
for such a tooling.
   ```

##########
File path: src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
##########
@@ -0,0 +1,110 @@
+# 49. Deduplicated blobs garbage collection with bloom filters
+
+Date: 2021-07-21
+
+## Status
+
+Accepted (lazy consensus).
+
+Not yet implemented.
+
+Proposes a simple to implement alternative to  [39. Distributed blob garbage 
collector](0039-distributed-blob-garbage-collector.md)
+
+## Context
+
+The body, headers, attachments of the mails are stored as blobs in a blob 
store.
+
+In order to save space in those stores, those blobs are de-duplicated using a 
hash of their content.
+
+To attain that the current blob store will read the content of the blob before 
saving it, and generate its id based on
+a hash of this content. This way two blobs with the same content will share 
the same id and thus be saved only once.
+
+This makes the safe deletion of one of those blobs a non-trivial problem as we 
can't delete one blob without ensuring
+that all references to it are themselves deleted. For example if two messages 
share the same blob, when we delete
+one message there is at the time being no way to tell if the blob is still 
referenced by another message.
+
+## Decision
+
+To solve this, we will propose a simple two steps algorithms to provide a 
background deduplication job.
+
+The **first step** consists in building a bloom filter using the entities 
referencing the blobs.
+
+In the **second step** we iterate over blobs and check in the bloom filter to 
predict if they are referenced on not.
+
+**Bloom filters** are probabilistic data structures. Here the reference 
prediction can produce false positives: we might 
+skip some non referenced blobs that should have been garbage collected. 
However, the associated probability can be tuned 
+and by adding a salt we can ensure subsequent runs will have different sets of 
false positives and thus that all blobs is
+eventually garbage collected.
+
+To avoid concurrency issues, where we could garbage collect a blob at the same 
time a new reference to it appear,
+a `reference generation` notion will be added. The de-duplicating id of the 
blobs which before where constructed
+using only the hash of their content,  will now include this `reference 
generation` too. To avoid synchronization 
+issues, the `generation` will be time based.
+
+So only blobs belonging to the `reference generation` `n-2` will be eligible 
for garbage collection to avoid 
+concurrency issues, and allow for a clock skew.
+
+Finally, we wish to offer the opportunity to configure, and reconfigure, the 
`generation` duration. In order to do so,
+we introduce a `generation family` part in the blobId. Incremented by the 
administrator on each configuration changes on
+the generation duration it allows avoiding conflicts in generations getting 
the same number before and after the change:
+all blobIds with a different family are considered belonging to a distinct 
generation ready to be garbage collected. This
+allows arbitrary changes in the generation duration.
+
+## Consequences
+
+We need to be able to list blobIds of blobs within the BlobStore. This 
operation should be supported by the blobStore 
+to prevent deduplication issues.
+
+We need to introduce a new API: `BlobReferenceSource` listing the blobIds 
currently referenced by an entity. We will
+need to implement it for each entity: Attachment, messages, mail queue and 
mail repositories.
+
+The garbage collection is a heavy-weight task whose complexity is proportional 
to the dataset size. 
+
+Regarding the garbage collection generation duration tradeoff:
+ - The longer, the longer it will take to effectively delete blobs
+ - However, the longer, the more efficient deduplication will be 
(deduplication is of course scoped to a single 
+   generation).
+ - Generation duration do not impact the overall process speed.
+ 
+Please note that this design do not require additional responsibilities for 
blobStore user and is thus fully transparent 

Review comment:
       ```suggestion
   Please note that this design does not require additional responsibilities 
for blobStore user and is thus fully transparent 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [james-project] Arsnael commented on a change in pull request #594: ADR Deduplicated blob garbage collection with Bloom Filters

Reply via email to