This is an automated email from the ASF dual-hosted git repository. btellier pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/james-project.git
commit 32c213998336e6db969d8b999964269a0c0e3390 Author: Benoit Tellier <[email protected]> AuthorDate: Wed Nov 11 15:19:37 2020 +0700 [ADR] JMAP: Avoid ElasticSearch on critical reads --- .../0043-avoid-elasticsearch-on-critical-reads.md | 164 +++++++++++++++++++++ 1 file changed, 164 insertions(+) diff --git a/src/adr/0043-avoid-elasticsearch-on-critical-reads.md b/src/adr/0043-avoid-elasticsearch-on-critical-reads.md new file mode 100644 index 0000000..e3954ca6 --- /dev/null +++ b/src/adr/0043-avoid-elasticsearch-on-critical-reads.md @@ -0,0 +1,164 @@ +# 43. Avoid ElasticSearch on critical reads + +Date: 2020-11-11 + +## Status + +Accepted (lazy consensus). + +Scope: Distributed James + +## Context + +A user willing to use a webmail powered by the JMAP protocol will end up doing the following operations: + - `Mailbox/get` to retrieve the mailboxes. This call is resolved against metadata stored in Cassandra. + - `Email/query` to retrieve the list of emails. This call is nowadays resolved on ElasticSearch for Email search after + a right resolution pass against Cassandra. + - `Email/get` to retrieve various levels of details. Depending on requested properties, this is either + retrieved from Cassandra alone or from ObjectStorage. + +So, ElasticSearch is queried on every JMAP interaction for listing emails. Administrators thus need to enforce availability and good performance +for this component. + +Relying on more services for every read also harms our resiliency as ElasticSearch outages have major impacts. + +Also we should mention our ElasticSearch implementation in Distributed James suffers the following flaws: + - Updates of flags lead to updates of the all Email object, leading to sparse segments + - We currently rely on scrolling for JMAP (in order to ensure messageId uniqueness in the response while respecting limit & position) + - We noticed some very slow traces against ElasticSearch, even for simple queries. + +Regarding Distributed James data-stores responsibilities: + - Cassandra is the source of truth for metadata, its storage needs to be adapted to known access patterns. + - ElasticSearch allows resolution of arbitrary queries, and performs full text search. + +## Decision + +Provide an optional view for most common `Email/query` requests both on Draft and RFC-8621 implementations. +This includes filters and sorts on 'sentAt'. + +This view will be stored into Cassandra, and updated asynchronously via a MailboxListener. + +## Consequences + +A migration task will be provided for new adopters. + +Administrators would be offered a configuration option to turn this view on and off as needed. + +If enabled, given clients following well defined Email/query requests, administrators would no longer need +to ensure high availability and good performances for ElasticSearch to ensure availability of basic usages +(mailbox content listing). + +Given these pre-requisites, we thus expect a decrease in overall ElasticSearch load, allowing savings compared +to actual deployments. Furthermore, we expect better performances by resolving such queries against Cassandra. + +The expected added load to Cassandra is low, as the search is a simple Cassandra read. As we only store messageId, +Cassandra dataset size will only grow of a few percents if enabled. + +## Alternatives + +Those not willing to adopt this view will not be affected. By disabling the listener and the view usage, they will keep +resolving all `Email/query` against ElasticSearch. + +## Example of optimized JMAP requests + +### A: Email list sorted by sentAt, with limit + +RFC-8621: + +``` +["Email/query", + { + "accountId": "29883977c13473ae7cb7678ef767cbfbaffc8a44a6e463d971d23a65c1dc4af6", + "filter: { + "inMailbox":"abcd" + } + "comparator": [{ + "property":"sentAt", + "isAscending": false + }], + "position": 30, + "limit": 30 + }, + "c1"] +``` + +Draft: + +``` +[["getMessageList", {"filter":{"inMailboxes": ["abcd"]}, "sort": ["date desc"]}, "#0"]] +``` + +### B: Email list sorted by sentAt, with limit, after a given receivedAt date + +RFC-8621: + +``` +["Email/query", + { + "accountId": "29883977c13473ae7cb7678ef767cbfbaffc8a44a6e463d971d23a65c1dc4af6", + "filter: { + "inMailbox":"abcd", + "after": "aDate" + } + "comparator": [{ + "property":"sentAt", + "isAscending": false + }], + "position": 30, + "limit": 30 + }, + "c1"] +``` + +Draft: Draft do only expose a single date property thus do not differenciate sentAt from receivedAt. Draft adopts sentAt +to back the date property up, thus the above request cannot be written using draft syntax. + +### C: Email list sorted by sentAt, with limit, after a given sentAt date + +Draft: + +``` +[["getMessageList", {"filter":{"after":"aDate", "inMailboxes": ["abcd"]}, "sort": ["date desc"]}, "#0"]] +``` + +RFC-8621: There is no filter properties targeting "sentAt" thus the above request cannot be written. + +## Cassandra table structure + +Several tables are required in order to implement this view on top of Cassandra. + +Eventual denormalization consistency can be enforced by using BATCH statements. + +A table allows sorting messages of a mailbox by sentAt, allows answering A and C: + +``` +TABLE email_query_view_sent_at +PRIMARY KEY mailboxId +CLUSTERING COLUMN sentAt +CLUSTERING COLUMN messageId +ORDERED BY sentAt +``` + +A table allows filtering emails after a receivedAt date. Given a limited number of results, soft sorting and limits can +be applied using the sentAt column. This allows answering B: + +``` +TABLE email_query_view_sent_at +PRIMARY KEY mailboxId +CLUSTERING COLUMN receivedAt +CLUSTERING COLUMN messageId +COLUMN sentAt +ORDERED BY receivedAt +``` + +Finally upon deletes, receivedAt and sentAt should be known. Thus we need to provide a lookup table: + +``` +TABLE email_query_view_date_lookup +PRIMARY KEY mailboxId +CLUSTERING COLUMN messageId +COLUMN sentAt +COLUMN receivedAt +``` + +Note that to handle position & limit, we need to fetch `position + limit` ordered items then removing `position` firsts items. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
