jeantil opened a new pull request, #2357: URL: https://github.com/apache/james-project/pull/2357
When deploying the initial blob store backed mail repository we quickly encountered issues with `EmptyErrorMailRepositoryHealthCheck` failing despite not having had errors. It turned out that our initial implementation stored all the mails from all the mail repositories in the same bucket with nothing to distinguish them. Therefore every `MailRepository` for every `MailRepositoryUrl` listed all the mails from all the urls and the size accounted for all mails stored in all the mail repositories hence making the health check fail as soon as any mail repository contained a mail. We managed to work around part of this problem by creating a mime message store with a dedicated blob store every time we created a mail repository for a url (see [`BlobMailRepositoryFactory`](https://github.com/apache/james-project/commit/9dfb5d5408390bc049976a70cbb3f7457e8794ef#diff-24ee66d7dc3eefc5f2493c144d59475af0e3c50d6baa30e6401bee36a8b91d1dR73)). Additionally we wrapped the underlying `BlobIdFactory` with our own [`MailRepositoryBlobId`](https://github.com/apache/james-project/commit/9dfb5d5408390bc049976a70cbb3f7457e8794ef#diff-24ee66d7dc3eefc5f2493c144d59475af0e3c50d6baa30e6401bee36a8b91d1dR29) to ensure the id were properly nested under a path matching the mail repository url. With this we were able to solve the list and size problems by filtering out items that didn't match the current mail repository url. In a future change we intend to extend the `BlobstoreDao` api to allow passing a sub path to the underlying driver. This would allow to avoid iterating over all the items of the bucket only to filter most of them out. We preferred this over creating buckets for each url because not every bucket provider allows for arbitrary bucket creation and our strategy works even if the urls are eventually stored in different buckets. We then stumbled upon another problem though I don't remember exactly how. Some parts of james expect that storing the same mail multiple times returns the same `Mailkey`. All previous implementations of mail repository already honor this by computing the mail key as `Mailkey.forMail(mail)` and returning the generated mail key. We added [a test](https://github.com/apache/james-project/commit/bd5b9a67ddfebe86983ce28e38d6952c144ed593#diff-f85ec649d3b197927fd49a26ee0c659a239073e80c8c47b0143136587f19d60eR460) to explicitely document this property. With the way the `Blobstore` API was designed, it was not possible to compute the `BlobId` externally to the blob store. Therefore it was impossible for a mail repository using the `PassthroughBlobstore` to honor this constraint because the pass through blob store would generate a different random blob id for each save. It was also impossible the `DeduplicationBlobStore` since it hashes the whole mail content, the mail can be mutated (by mailets for instance) and saving the same mail multiple time would thus result in a different blob id if any part of the mail had been mutated in between the 2 saves. We thus undertook to change the API of the `BlobStore` to allow providing a way to compute a blob id on every call while maintaining retrocompatibility with existing callers. This change is done in [ES-3763 Changes Blobstore api to enable deterministic BlobId generation](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519). You probably want to first have a look at the [`BlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-467d5afd98ab352519efdbbd49d57ecbddb6239083cc6ebf0ed2541b52392cb6) interface. Then see how we propagated these changes to each of the 4 implementations: - [`DeduplicationBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-bd94a7cf5577caebb0f006d85c11b92b8d8d5db210c932ee9aa9c70552f1bfaa) - [`PassthroughBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-c87a5eadd2fac4b2593635bc6b5659d13c2b38b490908436277ff987e82f6977) - [`MetricableBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-3a0b7b8f7a1d830a2eb7e52a9123f95023e3de1a1984438d8ad040ae2c7deb4d) - [`CachedBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-2871b3f2e76630dc9bbb91125230d48fa8cd16d28fe66d9128ff071824c063a3) Doing so we experimented with using default implementations in the interface, this required exposing a `defaultBlobIdProvider` method in the interface that was then accessible to the whole world. Since we can´t have protected methods in a Java interface we couldn´t hide this implementation detail from the outside world and decided instead to do the implementation in each of the implementors, It makes the `CachedBlobStore` implementation less elegant (it can probably still be improved upon) but we think the benefits of keeping the interface clean are more important. Changing this allowed us to clean up the [`BlobId`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-9fc865592b15120099aeb6be558096df780843bf64e16dc6989c0fd8a893d335) interface : - `forPayload` is only actually useful for the DeduplicationBlobStore, so we moved the implementation to that blob store - `randomId` is a small helper to wrap a random value in the `BlobId` implementation. It was only used in the [`PassthroughBlobStore`](https://github.com/apache/james-project/commit/dab9488beb7537586458c0d554a52432662e6519#diff-c87a5eadd2fac4b2593635bc6b5659d13c2b38b490908436277ff987e82f6977R92) where it could trivially be replaced in the default blob id provider and in tests. We feel that it is not the `BlobId` interface responsibility to expose this helper. For now we inlined it everywhere but it might make sense to resurrect it on `TestBlobId` ( which could now be a trivial extension of `PlainBlobId`) We spent quite a bit of time reworking our changes to make them more digestible. The PR starts with 2 cleanup commits that eliminate or rationalize many calls to either `forPayload` or to `randomId` - `forPayload` wad generally used to generate arbitrary blob ids - `randomId` calls were inlined, in a few cases randomId was used to provide an arbitrary id, we replaced these with static explicit values as is usually the case in the james codebase, Then comes the big change to the `BlobStore` API which we discuss extensively above. The next commit restores a check that used to exist around HashBlobId, it is only relevant to the S3 implementation which has specific contraints on the blob id values (they need to be compatible with url paths) We chose to extend this constraint to all deduplication implementations by adding it to the [`DeduplicationBlobStoreContract`](https://github.com/apache/james-project/commit/87cfdf11edef6dea5739cd5fc966cd37cdb6c5ea#diff-162ecf29eb8300288c33ad95247a50322e59e3b630b095673bc5c4389f35ec0b) We then use the newly added changes in `BlobMailRepository` in [JAMES-3763 Use Blobstore#save with deterministic BlobId in BlobMailRepository](https://github.com/apache/james-project/commit/559ad062b22a5f391f2552aa2d4a82e52755016a) The next 3 commits are cleanups : we rename concepts that have changed or been refined and we finish by switching from a POJO to a record. Thank you for reading this long message to its conclusion ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
