villebro commented on issue #9190: URL: https://github.com/apache/superset/issues/9190#issuecomment-2205419056
With this SIP still being behind an experimental feature flag, and not actively maintained, I've been thinking about ways we could simplify this architecture, and finally making this generally available in a forthcoming Superset release. The reason I think stabilizing this feature is very important Superset's current synchronous query execution model causes lots of issues: - If many people open the same chart/dashboard at the same time, both will execute a query to the underlying database, due to no locking of queries - if a user refreshes a dashboard multiple times, they can quickly congest the downstream database with heavy queries, both eating up webserver threads and database resources. - the web worker threads/processes get blocked waiting for long running queries to complete executing, making it impossible to effectively scale web worker replica sets based on CPU consumption. This should make it possible to get by with much slimmer webworker replica sets. Furthermore, async workers could be scaled up/down based on the queue depth. To simplify the architecture and reuse existing functionality, I propose the following: - The websocket architecture is removed. In the future only polling would be supported. Also the concept of a "query context cache key" is removed in favor of only a single cache key, i.e. the one we already use for chart data. - When requesting chart data, if the data exists in the cache, the data is returned normally - When chart data isn't available in the cache, only the `cache_key` is returned, along with additional details: when the most recent request has been submitted, status (pending, executing) etc. The async execution flow is changed to be similar to SQL Lab async execution, with the following changes: - when the async worker starts executing the query, the cache key is locked using the `KeyValueDistributedLock` class. This means that only a single worker executes any one cache key query at a time. - To support automatic cancelling of queries, we add a new optional field "poll_ttl" to the query context, which makes it possible to automatically cancel queries that are not being actively polled. Every time the cache key is polled, the latest poll time is updated on the metadata object. The worker periodically checks the metadata object, and if the `poll_ttl` is defined, and the last poll time is older, the query is automatically cancelled. This ensures that if a person closes a dashboard with lots of long running queries, the queries are automatically cancelled if nobody is actively waiting for the results. By default, frontend requests have `poll_ttl` set to whichever value is set in the config (`DEFAULT_CHART_DATA_POLL_TTL`). Cache warmup requests would likely not have a `poll_ttl` set, so as to avoid unnecessary polling. - To limit hammering the polling endpoint, we introduce a customizable backoff function in `superset_config.py`, which makes it possible to define how polling backoff should be implemented. The default behavior would be some sort of exponential backoff, where freshly started queries are polled more actively, and queries that have been pending/running for a long time are polled less frequently. When the frontend requests chart data, the backend provides the recommended wait time in the response time based on the backoff function. I assume we need a new SIP for this, but I wanted to drop this comment here to get initial feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
