[GitHub] [airflow] dstandish commented on a diff in pull request #26358: Dataset List View

GitBox Wed, 28 Sep 2022 09:00:39 -0700


dstandish commented on code in PR #26358:
URL: https://github.com/apache/airflow/pull/26358#discussion_r981866565



##########
airflow/www/views.py:
##########
@@ -3550,6 +3556,82 @@ def dataset_dependencies(self):
             {'Content-Type': 'application/json; charset=utf-8'},
         )
 
+    @expose('/object/list_datasets')
+    @auth.has_access([(permissions.ACTION_CAN_READ, 
permissions.RESOURCE_DATASET)])
+    def get_datasets(self):
+        """Get datasets"""
+        allowed_attrs = ['uri', 'last_dataset_update']
+
+        limit = int(request.args.get("limit", 25))
+        offset = int(request.args.get("offset", 0))
+        order_by = request.args.get("order_by", "uri")
+        lstripped_orderby = order_by.lstrip('-')
+
+        if lstripped_orderby not in allowed_attrs:
+            return {
+                "detail": (
+                    f"Ordering with '{lstripped_orderby}' is disallowed or the 
attribute does not "
+                    "exist on the model"
+                )
+            }, 400
+
+        limit = 50 if limit > 50 else limit
+
+        with create_session() as session:
+            if lstripped_orderby == "uri":
+                if order_by[0] == "-":
+                    order_by = (DatasetModel.uri.desc(),)
+                else:
+                    order_by = (DatasetModel.uri.asc(),)
+            elif lstripped_orderby == "last_dataset_update":

Review Comment:
   i don't know if you are supporting pagination here but... or if it matters, 
but, if you are ordering by dataset event timestamp, you might get unexpected 
results cus the order might change, with new events being added.  probably the 
timeframes are short enough that it wouldn't really matter.



##########
airflow/www/views.py:
##########
@@ -3550,6 +3556,82 @@ def dataset_dependencies(self):
             {'Content-Type': 'application/json; charset=utf-8'},
         )
 
+    @expose('/object/list_datasets')
+    @auth.has_access([(permissions.ACTION_CAN_READ, 
permissions.RESOURCE_DATASET)])
+    def get_datasets(self):
+        """Get datasets"""
+        allowed_attrs = ['uri', 'last_dataset_update']
+
+        limit = int(request.args.get("limit", 25))
+        offset = int(request.args.get("offset", 0))
+        order_by = request.args.get("order_by", "uri")
+        lstripped_orderby = order_by.lstrip('-')
+
+        if lstripped_orderby not in allowed_attrs:
+            return {
+                "detail": (
+                    f"Ordering with '{lstripped_orderby}' is disallowed or the 
attribute does not "
+                    "exist on the model"
+                )
+            }, 400
+
+        limit = 50 if limit > 50 else limit
+
+        with create_session() as session:
+            if lstripped_orderby == "uri":
+                if order_by[0] == "-":
+                    order_by = (DatasetModel.uri.desc(),)
+                else:
+                    order_by = (DatasetModel.uri.asc(),)
+            elif lstripped_orderby == "last_dataset_update":
+                if order_by[0] == "-":
+                    order_by = (
+                        func.max(DatasetEvent.timestamp).desc(),
+                        DatasetModel.uri.asc(),
+                    )
+                    if session.bind.dialect.name == "postgresql":
+                        order_by = (order_by[0].nulls_last(), *order_by[1:])
+                else:
+                    order_by = (
+                        func.max(DatasetEvent.timestamp).asc(),
+                        DatasetModel.uri.desc(),
+                    )
+                    if session.bind.dialect.name == "postgresql":
+                        order_by = (order_by[0].nulls_first(), *order_by[1:])
+
+            total_entries = session.query(func.count(DatasetModel.id)).scalar()
+
+            datasets = [
+                dict(dataset)
+                for dataset in session.query(
+                    DatasetModel.id,
+                    DatasetModel.uri,
+                    
func.max(DatasetEvent.timestamp).label("last_dataset_update"),
+                    
func.count(distinct(DatasetEvent.id)).label("total_updates"),
+                )
+                .outerjoin(DatasetEvent, DatasetEvent.dataset_id == 
DatasetModel.id)
+                .outerjoin(
+                    DagScheduleDatasetReference, 
DagScheduleDatasetReference.dataset_id == DatasetModel.id
+                )
+                .outerjoin(
+                    TaskOutletDatasetReference, 
TaskOutletDatasetReference.dataset_id == DatasetModel.id
+                )

Review Comment:
   do these two left joins do anything?



##########
airflow/www/views.py:
##########
@@ -3550,6 +3556,82 @@ def dataset_dependencies(self):
             {'Content-Type': 'application/json; charset=utf-8'},
         )
 
+    @expose('/object/list_datasets')
+    @auth.has_access([(permissions.ACTION_CAN_READ, 
permissions.RESOURCE_DATASET)])
+    def get_datasets(self):

Review Comment:
   given that this query will be more expensive than just querying the datasets 
table, maybe it makes sense to call this like .... `get_datasets_summary` or 
`get_dataset_summary_stats` or something.... because if all you want to do is 
list the datasets, maybe you wouldn't need or expect the other stuff



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] dstandish commented on a diff in pull request #26358: Dataset List View

Reply via email to