ktmud commented on a change in pull request #15648:
URL: https://github.com/apache/superset/pull/15648#discussion_r668470726
##########
File path: superset/models/dashboard.py
##########
@@ -246,13 +256,24 @@ def data(self) -> Dict[str, Any]:
unless=lambda: not is_feature_enabled("DASHBOARD_CACHE"),
)
def datasets_trimmed_for_slices(self) -> List[Dict[str, Any]]:
- datasource_slices = utils.indexed(self.slices, "datasource")
- return [
- # Filter out unneeded fields from the datasource payload
- datasource.data_for_slices(slices)
- for datasource, slices in datasource_slices.items()
- if datasource
- ]
+ # Verbose but efficient database enumeration of dashboard datasources.
+ datasource_slices: Dict[
+ Tuple[Type["BaseDatasource"], int], Set[Slice]
+ ] = defaultdict(set)
+
+ for slc in self.slices:
+ datasource_slices[(slc.cls_model, slc.datasource_id)].add(slc)
+
+ result: List[Dict[str, Any]] = []
+
+ for (cls_model, datasource_id), slices in datasource_slices.items():
+ datasource =
db.session.query(cls_model).filter_by(id=datasource_id).first()
+
+ if datasource:
+ # Filter out unneeded fields from the datasource payload
+ result.append(datasource.data_for_slices(slices))
+
+ return result
Review comment:
For this method, though, I believe the real bottle neck is in
`dashboard.data_for_slices` where [these two
lines](https://github.com/airbnb/superset-fork/blob/b295c6ad43798d33b15af7b34ab4a4ff1fd2a903/superset/connectors/base/models.py#L263-L264)
takes a lot of time.
Generating the `datasource.data` dictionary for the one big datasource we
have takes about ~1.2s (80% of the query time). There may be two solutions here:
1. Refactor `datasource.data(...)` and `datasource.data_for_slices(...)` to
load metrics & columns lazily.
2. Creating cache for `Datasource.data`, clearing the cache using SQLA hook
in a similar fashion to
[clear_dashboard_cache](https://github.com/airbnb/superset-fork/blob/f24264ccdc6bf6ab95f1969295dd9ba3612e9054/superset/models/dashboard.py#L405).
##########
File path: superset/models/dashboard.py
##########
@@ -171,8 +172,17 @@ def url(self) -> str:
@property
def datasources(self) -> Set[BaseDatasource]:
- # pylint: disable=using-constant-test
- return {slc.datasource for slc in self.slices if slc.datasource}
+ # Verbose but efficient database enumeration of dashboard datasources.
+ datasources = {(slc.cls_model, slc.datasource_id) for slc in
self.slices}
+ result: Set[BaseDatasource] = set()
+
+ for cls_model, datasource_id in datasources:
+ datasource =
db.session.query(cls_model).filter_by(id=datasource_id).first()
+
+ if datasource:
+ result.add(datasource)
+
+ return result
Review comment:
here's another way (probably more efficient):
```python
@property
def datasources(self) -> Set[BaseDatasource]:
datasources = defaultdict(set)
for slc in self.slices:
datasources[slc.cls_model].add(slc.datasource_id)
result = []
for cls_model, ids in datasources.items():
result +=
db.session.query(cls_model).filter(cls_model.id._in(ids)).all()
return set(result)
```
On one of my test dashboard, this returns an average query time of `0.028s`.
<img
src="https://user-images.githubusercontent.com/335541/125401010-750f3680-e367-11eb-9c1a-d4ff680ba986.png"
width="300" />
While the original method:
```python
@property
def datasources(self) -> Set[BaseDatasource]:
# pylint: disable=using-constant-test
return {slc.datasource for slc in self.slices if slc.datasource}
```
costs about `0.5s` (17x slower).
<img
src="https://user-images.githubusercontent.com/335541/125400609-f4503a80-e366-11eb-9779-f4feee779f84.png"
width="200">
The code in this PR is about 2x slower than my optimized version:
<img
src="https://user-images.githubusercontent.com/335541/125401154-a7209880-e367-11eb-922f-006f310dfac3.png"
width="200">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]