[GitHub] [incubator-superset] robdiciuccio opened a new issue #9190: [SIP-39] Global Async Query Support

GitBox Fri, 21 Feb 2020 15:58:22 -0800

robdiciuccio opened a new issue #9190: [SIP-39] Global Async Query Support
URL: https://github.com/apache/incubator-superset/issues/9190
 
 
   ## [SIP-39] Global Async Query Support
   
   ### Motivation
   
   - Loading time on dashboards with many charts is excessive due to the number 
of concurrent connections to the server.
   - Synchronous connections left open waiting for result data to be returned 
block other requests to the server.
   - Browsers limit the number of concurrent connections to the same host to 
six, causing bottlenecks.
   - Some of this bottleneck is currently mitigated via the dashboard tabs 
feature, which defers loading of charts in background tabs. This lazy-loading 
could also be applied to charts outside of the viewport.
   - [There is a proposal for SQL Lab to move away from tabs in the 
UI](https://github.com/apache/incubator-superset/issues/8655), altering the 
current async query polling mechanisms.
   - Provide a standardized query interface for dashboards, charts and SQL Lab 
queries.
   
   ### Proposed Change
   
   Provide a configuration setting to enable async data loading for charts 
(dashboards, Explore) and SQL Lab. Instead of multiple synchronous requests to 
the server to load dashboard charts, we issue async requests to the server 
which enqueue a background job and return a job ID. A single persistent 
websocket connection to the server is opened to listen for results in a 
realtime fashion.
   
   ## Websockets via a sidecar application
   
   **Pros**
   
   - Excellent websocket browser support
   - Realtime synchronous communication channel
   - Browser connections limits not an issue ([255 in 
Chrome](https://bugs.chromium.org/p/chromium/issues/detail?id=429419))
   - Supports bi-directional communication (possible future use cases)
   - Authentication/authorization via initial HTTP request that is then 
upgraded to a WSS connection.
   
   **Cons**
   
   - Minor modifications to proxies, load balancers to support websocket 
connections
       - Sticky sessions not required at the load balancer if the reconnection 
story is sound
   - Requires persistent connection to the server for *each tab*
       - 
[SharedWorker](https://developer.mozilla.org/en-US/docs/Web/API/SharedWorker) 
also evaluated, but support is lacking
   - AWS "Classic" ELB [does not support 
Websockets](https://aws.amazon.com/elasticloadbalancing/features/). Requires 
use of NLB (TCP) or ALB in AWS environments.
   
   ![Superset async query arch - Websockets v2 
(2)](https://user-images.githubusercontent.com/296227/75080027-9f84f900-54bf-11ea-8494-ebd531e33450.png)
   
   ### Approach
   
   Each open tab of Superset would create a unique "channel" ID to subscribe 
to. A websocket connection is established with the standalone websocket server 
as an HTTP request that is then upgraded to `wss://` if authentication with the 
main Flask app is successful. Requests for charts or queries in SQL Lab are 
sent via HTTP to the Flask web app, including the tab's channel ID. The server 
enqueues a celery job and returns a job ID to the caller. When the job 
completes, a notification is sent over the WSS connection, which triggers 
another HTTP request to fetch the cached data and display the results.
   
   **Why a separate application?**
   
   The current Flask application is not well suited for persistent websocket 
connections. We have evaluated several Python and Flask-based solutions, 
including [flask-SocketIO](https://flask-socketio.readthedocs.io/en/latest/) 
and [others](https://pgjones.gitlab.io/quart/), and have found the 
architectural changes to the Flask web app to be overly invasive. For that 
reason, we propose that the websocket service be a standalone application, 
decoupled from the main Flask application, with minimal integration points. 
Superset's extensive use of Javascript and the mature Node.js [websocket 
libraries](https://www.npmjs.com/package/ws) make Node.js and TypeScript 
([SIP-36](https://github.com/apache/incubator-superset/issues/9101)) an obvious 
choice for implementing the sidecar application.
   
   **Reconnection**
   
   The nature of persistent connections is that they will, at some point, 
disconnect. The system should be able to reconnect and "catch up" on any missed 
events. We evaluated several PubSub solutions (mainly in Redis) that could 
enable this durable reconnection story, and have determined that [Redis 
Streams](https://redis.io/topics/streams-intro) (Redis ≥ 5.0) fits this use 
case well. By storing a last received event ID, the client can pass that ID 
when reconnecting to fetch all messages in the channel from that point forward. 
For security reasons, we should periodically force the client to reconnect to 
revalidate authentication status.
   
   **Why not send result data over the socket connection?**
   
   While it is possible to send result data over the websocket connection, 
keeping the scope of the standalone service to event notifications will reduce 
the security footprint of the sidecar application. Fetching (potentially 
sensitive) data will still require necessary authentication and authorization 
checks at load time by routing through the Flask web app. Sending large 
datasets over the websocket protocol introduces potential unknown performance 
and consistency issues of its own. Websockets are not streams, and "[the client 
will only be notified about a new message once all of the frames have been 
received and the original message payload has been 
reconstructed.](https://codeburst.io/polling-vs-sse-vs-websocket-how-to-choose-the-right-one-1859e4e13bd9)"
   
   **Query Cancellation**
   
   Queries may be "cancelled" by calling the `/superset/stop_query` endpoint in 
SQL Lab, which simply sets `query.status = QueryStatus.STOPPED` for the running 
query. This cancellation logic is currently implemented only for queries 
running against 
[hive](https://github.com/apache/incubator-superset/blob/607cfd1f29736590fbba397c4f8a04526be66aff/superset/db_engine_specs/hive.py#L264)
 and 
[presto](https://github.com/apache/incubator-superset/blob/607cfd1f29736590fbba397c4f8a04526be66aff/superset/db_engine_specs/presto.py#L730)
 databases. Queries that have been enqueued could be cancelled prior to 
executing the query by adding a check in the celery worker logic. It is also 
possible to [revoke a Celery 
task](https://docs.celeryproject.org/en/stable/userguide/workers.html#revoke-revoking-tasks),
 which will skip execution of the task, but it won’t terminate an already 
executing task. Due to the limited query cancellation support in DB-API 
drivers, some of which is [discussed 
here](https://github.com/apache/incubator-superset/issues/747), comprehensive 
query cancellation functionality should be explored in a separate SIP. That 
said, query cancellation requests may still be issued to the existing (or 
similar) endpoint when users intentionally navigate away from a dashboard or 
Explore view with charts in a loading state.
   
   **Query deduplication**
   
   The `query` table in the Superset metadata database currently includes only 
queries run via SQL Lab. Adapting this for use with dashboards and charts may 
have a larger impact than we're willing to accept at this time. Instead, a 
separate key-value store (e.g. Redis) may be used for tracking and preventing 
duplicate queries. Using a fast KV store also allows us to check for duplicate 
queries more efficiently in the web request cycle.
   
   Each query issued to the backend can be fingerprinted using a hashing 
algorithm (SHA-256 or similar) to generate a unique key based on the following:
   
   - Normalized SQL query text
   - Database ID
   - Database schema
   
   Prior to executing a query, a hash (key) is generated and checked against a 
key-value store. If the key does not exist, it is stored with a configured TTL, 
containing an object value with the following properties:
   
   - Query state (running, success, error)
   - Query type (chart, sql_lab)
   - Result key (cache key)
   
   Another key is created to track the Channels and Job IDs that should be 
notified when this query completes (e.g. `<hash>_jobs`→ 
`List[ChannelId:JobId]`). If a duplicate query is issued while currently 
running, the Job ID is pushed onto the list, and all relevant channels are 
notified via websocket when the query completes. If a query is issued and a 
cache key exists with `state == "success"`, a notification is triggered 
immediately via websocket to the client. If queries are "force" refreshed, 
query deduplication is performed only for currently running queries.
   
   ### New or Changed Public Interfaces
   
   - New configuration options
   - Additional API endpoints (see Migration Plan, below)
   
   ### New dependencies (optional)
   
   - Redis ≥ 5.0 (pubsub, Redis Streams) or alternative
   - Node.js runtime for websocket application
   
   ### Migration Plan and Compatibility
   
   Asynchronous query operations are currently an optional enhancement to 
Superset, and should remain that way. Configuring and running Celery workers 
should not be required for basic Superset operation. As such, this proposed 
websocket approach to async query operations should be an optional enhancement 
to Superset, available via a configuration flag.
   
   For users who opt to run Superset in full async mode, the following 
requirements will apply under the current proposal:
   
   - Chart cache backend
   - SQL Lab results backend
   - Pubsub backend (current recommendation: Redis Streams)
   - Running the sidecar Node.js websocket application
   
   [Browsers that do not support 
websockets](https://caniuse.com/#feat=websockets) (very few) should fallback to 
synchronous operation or short polling.
   
   **Migration plan:**
   
   - Introduce new configuration options to enable asynchronous operation in 
Superset. Pass these values to the frontend as feature flags (as appropriate)
   - Introduce new API endpoints in Superset's Flask web application for async 
query/data loading for charts and queries
   - Introduce new API endpoints in Superset's Flask web application required 
for interaction with the sidecar application
   - Build and deploy the Node.js websocket server application
   - Build and deploy async frontend code to consume websocket events, under 
feature flag
   
   ### Rejected Alternatives
   
   ### SSE (EventSource) over HTTP/2
   
   [Server-Sent 
Events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events)
 (SSE) streams data over a multiplexed HTTP/2 connection. SSE is a native HTML5 
feature that allows the server to keep the HTTP connection open and push data 
to the client. Thanks to the multiplexing feature of the HTTP/2 protocol, the 
number of concurrent requests per domain is not limited to 6-8, but it is 
virtually unlimited.
   
   **Pros**
   
   - Single, multiplexed connection to the server (via HTTP/2)
       - Connections can be limited via `SETTINGS_MAX_CONCURRENT_STREAMS`, but 
this [should be no smaller than 100](https://http2.github.io/http2-spec/)
   - Events pushed to the client from the server in realtime
   - Supports streaming data and traditional HTTP authentication methods
   - Minor adjustments to proxy, load balancers
   - Can use async 
[generators](https://medium.com/code-zen/python-generator-and-html-server-sent-events-3cdf14140e56)
 with Flask & gevent/eventlet (untested)
   
   **Cons**
   
   - HTTP/2 not supported in [some browsers](https://caniuse.com/#feat=http2) 
(IE)
   - SSE requires [polyfill](https://github.com/Yaffle/EventSource) for [some 
browsers](https://caniuse.com/#feat=eventsource)
   - Requires persistent connection to the server for *each tab*
   - HTTP/2 requires TLS (not really a con, per se)
   - If SSE is used without HTTP/2, persistent SSE connections may saturate 
browser connection limit with multiple tabs
   - AWS "Classic" ELB [does not support 
HTTP/2](https://aws.amazon.com/elasticloadbalancing/features/). ALB supports 
HTTP/2, but apparently converts it to HTTP/1.1 upstream (AWS-specific)
   - Does not support upstream messaging (from the client to the server)
   
   _NOTE: HTTP/2 multiplexing could still potentially be valuable alongside the 
websocket features, and should be investigated further._
   
   ### Long polling (aka Comet)
   
   **Pros**
   
   - Excellent browser support (a standard HTTP request)
   
   **Cons**
   
   - Requires long-lived connections
   - Browser limit on concurrent connections to the same domain (6-8) results 
in connection blocking (HTTP/1.1)
   
   ### Short polling
   
   **Pros**
   
   - Excellent browser support (a standard HTTP request)
   - It's the current solution for async queries in SQL Lab
   
   **Cons**
   
   - Browser limit on concurrent connections to the same domain (6-8) results 
in connection blocking (HTTP/1.1)
   - Header overhead: every poll request and response contains a full set of 
HTTP headers
   - Auth overhead: each polling request must be authenticated and authorized 
on the server
   
   ---
   Thanks to @etr2460 @suddjian @nytai @willbarrett @craig-rueda for feedback 
and review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-superset] robdiciuccio opened a new issue #9190: [SIP-39] Global Async Query Support

Reply via email to