benjreinhart opened a new issue #14112:
URL: https://github.com/apache/superset/issues/14112


   [SIP-39 - Global Async Query 
Support](https://github.com/apache/superset/issues/9190) mentions a future 
effort around deduplicating and cancelling asynchronous queries. I'm opening 
this issue to document our implementation plans for visibility + feedback.
   
   While we explored a few options for implementation, this is our current 
preferred approach.
   
   cc active participants in SIP-39: @robdiciuccio @etr2460 @DiggidyDave 
@willbarrett @williaster 
   
   ## Overview
   
   Deduplication will require the current flow, a 1:1 mapping of query to 
client, to move to a 1:N mapping of query to N clients.
   
   ### State
   
   **Existing state**
   
   The application currently performs four writes to Redis to support both 
sync/async queries:
   
   - [sync/async] Write the query results (and some other metadata) to cache
   - [async] Write the query context (for validating user access when later 
requesting cached values)
   - [async] Write to global event stream (for WebSocket server)
   - [async] Write to user-specific event stream (for WebSocket server)
   
   **New state requirements**
   
   All of the above values are currently written *after* the background worker 
has executed the query. Deduplication and cancellation features will need some 
state *before* the query has executed. To track requests, we'll need additional 
state:
   
   - All job ids that are waiting for the results of a given query
   - Some job metadata such as the query key, channel id, and user id
   
   ### Architecture
   
   ![Async Query 
Deduplication](https://user-images.githubusercontent.com/606233/114635006-a9942900-9c78-11eb-9389-3dde03a508c4.png)
   
   - Each request generates a key unique to that particular request using the 
generated SQL and any other relevant parameters (e.g., impersonating user). 
Referring to this as the *query key*.
   - A set of job ids are associated with the query key. These are waiting on 
query completion.
   - Each job id has its own JSON blob in Redis, containing job metadata.
   - Deduplication occurs when a query request generates a query key that maps 
to an existing key in Redis with a non-empty set of job ids. In this case, a 
new job is created and added to the set of jobs waiting on that query to 
complete. No new celery tasks are enqueued.
   - When the Celery task running the query completes, it looks up all job ids 
waiting completion of the query. For each one, it grabs its job metadata and 
adds events to the Redis streams which will be consumed by the WebSocket server 
and forwarded to connected clients.
   - If a request comes in for a query whose results are already cached, the 
server should return the cached results.
   - Cancellation for a particular client can occur via removing its job id 
from the set. If the set of job ids awaiting query completion is empty, the 
query can be safely cancelled.
   
   **Pros**
   
   - Existing state in Redis does not need to be modified
   - All Redis operations are atomic, no read-modify-write race conditions to 
consider
   
   **Cons**
   
   - Some cognitive overhead given the number of k/v pairs in Redis needed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to