john-bodley opened a new pull request, #22413:
URL: https://github.com/apache/superset/pull/22413

   <!---
   Please write the PR title following the conventions at 
https://www.conventionalcommits.org/en/v1.0.0/
   Example:
   fix(dashboard): load charts correctly
   -->
   
   ### SUMMARY
   
   In conjunction with https://github.com/dpgaspar/Flask-AppBuilder/pull/1961, 
this PR defines the explicit bidirectional relationship mapping between the 
`SqlaTable` model (parent) and the corresponding `SqlMetric` and `TableColumn` 
models (children) to improve the performance of the `GET` 
`/api/v1/dataset/{pk}` endpoint.
   
   The motivation was driven by a specific (and somewhat atypical) virtual 
dataset at Airbnb which is comprised of over 7,000 columns and 30,000 metrics 
where previously the `/api/v1/dataset/{pk}` endpoint would timeout due to:
   
   1. FAB was using [Joined Eager 
Loading](https://docs.sqlalchemy.org/en/14/orm/loading_relationships.html#joined-eager-loading)
 which "eagerly" loads the children alongside the parent, however this results 
in a slew of `LEFT OUTER JOIN`s which will rapidly result into a expansive 
interim result set—from a multiplicative combinatorial sense—given that the 
`SqlaTable` model has potentially multiple high cardinality relationships 
(owners, columns, metrics). Furthermore the result set—which is repetitive in 
nature—requires further overhead by SQLAlchemy to 
[de-duplicate](https://docs.sqlalchemy.org/en/14/orm/loading_relationships.html#joined-eager-loading-and-result-set-batching)
 the rows against the actual model record.
   2. Superset was using lazy loading (see 
[this](https://apache-superset.slack.com/archives/G013HAE6Y0K/p1670880289179439)
 Slack thread) when back referencing the `TableColumn.table` field when 
evaluating the `columns.type_generic` field which resulted in `N + 1` SELECT 
statements, i.e., for each column it would execute a SQL statement to fetch the 
corresponding table ID.
   
   The first issue is addressed in 
https://github.com/dpgaspar/Flask-AppBuilder/pull/1961 which meant that the 
response time (excluding downloading of the payload) went from timing out to ~ 
20 seconds. The second issue was addressed in this PR (which the FAB change 
leverages) alongside defining the eager loading of the back (parent) references 
which are pre-defined whilst fetching the `SqlaTable` model. This further 
reduced the response time to ~ 5 seconds.
   
   ### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
   <!--- Skip this if not applicable -->
   
   ### TESTING INSTRUCTIONS
   
   CI. Tested locally. 
   
   ### ADDITIONAL INFORMATION
   <!--- Check any relevant boxes with "x" -->
   <!--- HINT: Include "Fixes #nnn" if you are fixing an existing issue -->
   - [ ] Has associated issue:
   - [ ] Required feature flags:
   - [ ] Changes UI
   - [ ] Includes DB Migration (follow approval process in 
[SIP-59](https://github.com/apache/superset/issues/13351))
     - [ ] Migration is atomic, supports rollback & is backwards-compatible
     - [ ] Confirm DB migration upgrade and downgrade tested
     - [ ] Runtime estimates and downtime expectations provided
   - [ ] Introduces new feature or API
   - [ ] Removes existing feature or API
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org
For additional commands, e-mail: notifications-h...@superset.apache.org

Reply via email to