[PR] JAMES-3763 Refactor blobstore api [james-project]

via GitHub Fri, 19 Jul 2024 05:02:45 -0700


jeantil opened a new pull request, #2357:
URL: https://github.com/apache/james-project/pull/2357


   When deploying the initial blob store backed mail repository we quickly 
encountered issues with `EmptyErrorMailRepositoryHealthCheck` failing despite 
not having had errors. 
   
   It turned out that our initial implementation stored all the mails from all 
the mail repositories in the same bucket with nothing to distinguish them. 
Therefore every `MailRepository` for every `MailRepositoryUrl` listed all the 
mails from all the urls and the size accounted for all mails stored in all the 
mail repositories hence making the health check fail as soon as any mail 
repository contained a mail. 
   
   We managed to work around part of this problem by creating a mime message 
store with a dedicated blob store every time we created a mail repository for a 
url (see 
[`BlobMailRepositoryFactory`](https://github.com/apache/james-project/commit/9dfb5d5408390bc049976a70cbb3f7457e8794ef#diff-24ee66d7dc3eefc5f2493c144d59475af0e3c50d6baa30e6401bee36a8b91d1dR73)).
 
   Additionally we wrapped the underlying `BlobIdFactory` with our own 
[`MailRepositoryBlobId`](https://github.com/apache/james-project/commit/9dfb5d5408390bc049976a70cbb3f7457e8794ef#diff-24ee66d7dc3eefc5f2493c144d59475af0e3c50d6baa30e6401bee36a8b91d1dR29)
 to ensure the id were properly nested under a path matching the mail 
repository url. 
   
   With this we were able to solve the list and size problems by filtering out 
items that didn't match the current mail repository url.  
   
   In a future change we intend to extend the `BlobstoreDao` api to allow 
passing a sub path to the underlying driver. This would allow to avoid 
iterating over all the items of the bucket only to filter most of them out.
   
   We preferred this over creating buckets for each url because not every 
bucket provider allows for arbitrary bucket creation and our strategy works 
even if the urls are eventually stored in different buckets.
   
   We then stumbled upon another problem though I don't remember exactly how. 
Some parts of james expect that storing the same mail multiple times returns 
the same `Mailkey`. All previous implementations of mail repository already 
honor this by computing the mail key as `Mailkey.forMail(mail)` and returning 
the generated mail key. We added [a 
test](https://github.com/apache/james-project/commit/bd5b9a67ddfebe86983ce28e38d6952c144ed593#diff-f85ec649d3b197927fd49a26ee0c659a239073e80c8c47b0143136587f19d60eR460)
 to explicitely document this property.
   
   With the way the `Blobstore` API was designed, it was not possible to 
compute the `BlobId` externally to the blob store. Therefore it was impossible 
for a mail repository using the `PassthroughBlobstore` to honor this constraint 
because the pass through blob store would generate a different random blob id 
for each save. It was also impossible the `DeduplicationBlobStore` since it 
hashes the whole mail content, the mail can be mutated (by mailets for 
instance) and saving the same mail multiple time would thus result in a 
different blob id if any part of the mail had been mutated in between the 2 
saves.
   
   We thus undertook to change the API of the `BlobStore` to allow providing a 
way to compute a blob id on every call while maintaining retrocompatibility 
with existing callers. This change is done in [ES-3763 Changes Blobstore api to 
enable deterministic BlobId 
generation](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519).
 You probably want to first have a look at the 
[`BlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-467d5afd98ab352519efdbbd49d57ecbddb6239083cc6ebf0ed2541b52392cb6)
 interface. Then see how we propagated these changes to each of the 4 
implementations: 
   - 
[`DeduplicationBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-bd94a7cf5577caebb0f006d85c11b92b8d8d5db210c932ee9aa9c70552f1bfaa)
   - 
[`PassthroughBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-c87a5eadd2fac4b2593635bc6b5659d13c2b38b490908436277ff987e82f6977)
   - 
[`MetricableBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-3a0b7b8f7a1d830a2eb7e52a9123f95023e3de1a1984438d8ad040ae2c7deb4d)
   - 
[`CachedBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-2871b3f2e76630dc9bbb91125230d48fa8cd16d28fe66d9128ff071824c063a3)
   
   Doing so we experimented with using default implementations in the 
interface, this required exposing a `defaultBlobIdProvider` method in the 
interface that was then accessible to the whole world. Since we can´t have 
protected methods in a Java interface we couldn´t hide this implementation 
detail from the outside world and decided instead to do the implementation in 
each of the implementors, It makes the `CachedBlobStore` implementation less 
elegant (it can probably still be improved upon) but we think the benefits of 
keeping the interface clean are more important. 
   Changing this allowed us to clean up the 
[`BlobId`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-9fc865592b15120099aeb6be558096df780843bf64e16dc6989c0fd8a893d335)
 interface : 
   - `forPayload` is only actually useful for the DeduplicationBlobStore, so we 
moved the implementation to that blob store
   - `randomId` is a small helper to wrap a random value in the `BlobId` 
implementation. It was only used in the 
[`PassthroughBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-c87a5eadd2fac4b2593635bc6b5659d13c2b38b490908436277ff987e82f6977R92)
 where it could trivially be replaced in the default blob id provider and in 
tests. We feel that it is not the `BlobId` interface responsibility to expose 
this helper. For now we inlined it everywhere but it might make sense to 
resurrect it on `TestBlobId` ( which could now be a trivial extension of 
`PlainBlobId`)
   
   We spent quite a bit of time reworking our changes to make them more 
digestible. 
   
   The PR starts with 2 cleanup commits that eliminate or rationalize many 
calls to either `forPayload` or to `randomId`
   - `forPayload` wad generally used to generate arbitrary blob ids 
   - `randomId` calls were inlined, in a few cases randomId was used to provide 
an arbitrary id, we replaced these with static explicit values as is usually 
the case in the james codebase,
   
   Then comes the big change to the `BlobStore` API which we discuss 
extensively above.
   
   The next commit restores a check that used to exist around HashBlobId, it is 
only relevant to the S3 implementation which has specific contraints on the 
blob id values (they need to be compatible with url paths) We chose to extend 
this constraint to all deduplication implementations by adding it to the 
[`DeduplicationBlobStoreContract`](https://github.com/apache/james-project/commit/87cfdf11edef6dea5739cd5fc966cd37cdb6c5ea#diff-162ecf29eb8300288c33ad95247a50322e59e3b630b095673bc5c4389f35ec0b)
   
   We then use the newly added changes in `BlobMailRepository` in [JAMES-3763 
Use Blobstore#save with deterministic BlobId in 
BlobMailRepository](https://github.com/apache/james-project/commit/559ad062b22a5f391f2552aa2d4a82e52755016a)
   
   The next 3 commits are cleanups : we rename concepts that have changed or 
been refined and we finish by switching from a POJO to a record. 
   
   Thank you for reading this long message to its conclusion !
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] JAMES-3763 Refactor blobstore api [james-project]

Reply via email to