houqp commented on a change in pull request #7492: [AIRFLOW-6871] optimize tree 
view for large DAGs
URL: https://github.com/apache/airflow/pull/7492#discussion_r384869019
 
 

 ##########
 File path: airflow/www/views.py
 ##########
 @@ -1374,90 +1376,115 @@ def tree(self):
                 .all()
             )
         dag_runs = {
-            dr.execution_date: alchemy_to_dict(dr) for dr in dag_runs}
+            dr.execution_date: alchemy_to_dict(dr) for dr in dag_runs
+        }
 
         dates = sorted(list(dag_runs.keys()))
         max_date = max(dates) if dates else None
         min_date = min(dates) if dates else None
 
         tis = dag.get_task_instances(start_date=min_date, end_date=base_date)
-        task_instances = {}
+        task_instances: Dict[Tuple[str, datetime], models.TaskInstance] = {}
         for ti in tis:
-            tid = alchemy_to_dict(ti)
-            dr = dag_runs.get(ti.execution_date)
-            tid['external_trigger'] = dr['external_trigger'] if dr else False
-            task_instances[(ti.task_id, ti.execution_date)] = tid
+            task_instances[(ti.task_id, ti.execution_date)] = ti
 
-        expanded = []
+        expanded = set()
         # The default recursion traces every path so that tree view has full
         # expand/collapse functionality. After 5,000 nodes we stop and fall
         # back on a quick DFS search for performance. See PR #320.
-        node_count = [0]
+        node_count = 0
         node_limit = 5000 / max(1, len(dag.leaves))
 
+        def encode_ti(ti: Optional[models.TaskInstance]) -> Optional[List]:
+            if not ti:
+                return None
+
+            # NOTE: order of entry is important here because client JS relies 
on it for
+            # tree node reconstruction. Remember to change JS code in tree.html
+            # whenever order is altered.
+            data = [
+                ti.state,
+                ti.try_number,
+                None,  # start_ts
+                None,  # duration
+            ]
+
+            if ti.start_date:
+                # round to seconds to reduce payload size
+                data[2] = int(ti.start_date.timestamp())
+                if ti.duration is not None:
+                    data[3] = int(ti.duration)
+
+            return data
+
         def recurse_nodes(task, visited):
+            nonlocal node_count
+            node_count += 1
             visited.add(task)
-            node_count[0] += 1
-
-            children = [
-                recurse_nodes(t, visited) for t in task.downstream_list
-                if node_count[0] < node_limit or t not in visited]
-
-            # D3 tree uses children vs _children to define what is
-            # expanded or not. The following block makes it such that
-            # repeated nodes are collapsed by default.
-            children_key = 'children'
-            if task.task_id not in expanded:
-                expanded.append(task.task_id)
-            elif children:
-                children_key = "_children"
-
-            def set_duration(tid):
-                if (isinstance(tid, dict) and tid.get("state") == 
State.RUNNING and
-                        tid["start_date"] is not None):
-                    d = timezone.utcnow() - timezone.parse(tid["start_date"])
-                    tid["duration"] = d.total_seconds()
-                return tid
-
-            return {
+            task_id = task.task_id
+
+            node = {
                 'name': task.task_id,
                 'instances': [
-                    set_duration(task_instances.get((task.task_id, d))) or {
-                        'execution_date': d.isoformat(),
-                        'task_id': task.task_id
-                    }
-                    for d in dates],
-                children_key: children,
+                    encode_ti(task_instances.get((task_id, d)))
+                    for d in dates
+                ],
                 'num_dep': len(task.downstream_list),
                 'operator': task.task_type,
                 'retries': task.retries,
                 'owner': task.owner,
-                'start_date': task.start_date,
-                'end_date': task.end_date,
-                'depends_on_past': task.depends_on_past,
                 'ui_color': task.ui_color,
-                'extra_links': task.extra_links,
             }
 
+            if task.downstream_list:
+                children = [
+                    recurse_nodes(t, visited) for t in task.downstream_list
+                    if node_count < node_limit or t not in visited]
+
+                # D3 tree uses children vs _children to define what is
+                # expanded or not. The following block makes it such that
+                # repeated nodes are collapsed by default.
+                if task.task_id not in expanded:
+                    children_key = 'children'
+                    expanded.add(task.task_id)
+                else:
+                    children_key = "_children"
+                node[children_key] = children
+
+            if task.depends_on_past:
+                node['depends_on_past'] = task.depends_on_past
+            if task.start_date:
+                # round to seconds to reduce payload size
 
 Review comment:
   This code path has a very hot function call loop that's very sensitive to if 
statements. For the large DAG that we have, adding one extra if statement 
increases the response time by more than 400ms. That's why `simplify 
reduce_nodes() logic, remove unnecessary if statements` is in the optimization 
list :)
   
   That and based on the understanding that we are rewriting Airflow web into a 
proper SPA, I think it's best not to introduce a config for this change. I 
would prefer us giving round to second a try and come back to add more sig. 
fig. or add a config later on if any real use-case comes up. It's better to not 
engineer solutions when we don't have a good use-case in mind.
   
   If you are really concerned about the precision, we can perhaps change it to 
round to 1 sig. fig. for a 7% performance hit. I can't really think of a case 
where knowing 0.01 second difference of runtime is important.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to