xBis7 commented on PR #54103:
URL: https://github.com/apache/airflow/pull/54103#issuecomment-3678912476
I've been running various tests and while doing so, I noticed the metrics
that I was getting weren't consistent between consecutive runs. So, I started
logging the same values that I was sending to OTel for creating the metrics and
I found out that in certain cases, the final numbers were different from what
we would see in the Grafana dashboards. For example, I would see in Grafana
that the maximum number of active concurrent dag runs was 5 while the actual
number was higher and it was 7. The reason for that is that we export the
current value at an interval and not all past and present values. The collector
gets samples and not a complete list of the values.
In order to capture more values for the diagrams, I decreased the sampling
step and the export interval by a lot. That made even instant values visible in
the diagrams.
The higher the load, the more the time the tests take to run, and so we end
up with more samples for the metrics which in return lead to a more reliable
and representative diagram.
A very simple test case that I ran, was triggering 2 dag runs of the same
dag with and without the patch. In that scenario, the results are random.
Sometimes the patch is faster while others it's not but in all cases the
difference is negligible. For example, the scheduler might do 4 iterations with
the patch and 5 without, or 5 in both cases or 4 without and 6 with. The
timings end up +-1s.
## Testing different topologies
I used the following dags
* 5 linear dags with 10 tasks each, all in a sequence
```
0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 -> 9 -> 10
```
* 2 dags with a single root task and 100 downstream tasks all running in
parallel
```
0 -> (1_1, 1_2, 1_3, 1_4, ..., 1_100 -- parallel)
```
* 5 branching dags with 1 root task, that has 5 children tasks, and each
child task has 3 children tasks
```
0
|
_______________________________________________________________________________________
| | |
| |
1_1 1_2 1_3
1_4 1_5
| | |
| |
_____________ _____________ _____________
_______________ _______________
| | | | | | | | | |
| | | | |
2_1 2_2 2_3 2_4 2_5 2_6 2_7 2_8 2_9 2_10
2_11 2_12 2_13 2_14 2_15
```
I triggered all of them and also created multiple dag_runs for most of them.
There wasn't much difference in the number of scheduler iterations but it
consistently queued the number of tasks that it examined, in contrast to
examining a very high number of tasks, only to queue just a few which is what
happens without the patch.
The difference in the time needed to run the dags, is noticeable.
* with the patch
* scheduler iterations: 150
* total time: 248.28s
* without the patch
* scheduler iterations: 162
* total time: 345.64s
<img width="2238" height="974" alt="test_topologies"
src="https://github.com/user-attachments/assets/89793622-2308-4429-bfeb-a6fd24317663"
/>
## Testing heavy load
This is the original test where I created multiple dags with all tasks
running in parallel without any dependencies between them.
* dag_45_tasks
* 45 parallel tasks
* dag_250_tasks
* 250 parallel tasks
* dag_470_tasks
* 470 parallel tasks
* dag_1000_tasks
* 1000 parallel tasks
* dag_1100_tasks
* 1100 parallel tasks
* dag_1200_tasks
* 1200 parallel tasks
I reran this test because I made a lot of changes in my testing
infrastructure and I wanted to verify that the new metrics are in line with the
old ones that I previously shared.
* with the patch
* scheduler iterations: 402
* max number of cuncurrent DRs: 6
* total time: 675.25s
* without the patch
* scheduler iterations: 1343
* max number of cuncurrent DRs: 3
* total time: 1024.49s
<img width="2238" height="974" alt="test_heavy_load"
src="https://github.com/user-attachments/assets/8f18cfca-3203-4505-aa42-c9769d54d1a3"
/>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]