xBis7 commented on PR #54103:
URL: https://github.com/apache/airflow/pull/54103#issuecomment-3678912476

   I've been running various tests and while doing so, I noticed the metrics 
that I was getting weren't consistent between consecutive runs. So, I started 
logging the same values that I was sending to OTel for creating the metrics and 
I found out that in certain cases, the final numbers were different from what 
we would see in the Grafana dashboards. For example, I would see in Grafana 
that the maximum number of active concurrent dag runs was 5 while the actual 
number was higher and it was 7. The reason for that is that we export the 
current value at an interval and not all past and present values. The collector 
gets samples and not a complete list of the values.
   
   In order to capture more values for the diagrams, I decreased the sampling 
step and the export interval by a lot. That made even instant values visible in 
the diagrams.
   
   The higher the load, the more the time the tests take to run, and so we end 
up with more samples for the metrics which in return lead to a more reliable 
and representative diagram.
   
   A very simple test case that I ran, was triggering 2 dag runs of the same 
dag with and without the patch. In that scenario, the results are random. 
Sometimes the patch is faster while others it's not but in all cases the 
difference is negligible. For example, the scheduler might do 4 iterations with 
the patch and 5 without, or 5 in both cases or 4 without and 6 with. The 
timings end up +-1s.
   
   ## Testing different topologies
   
   I used the following dags
   
   * 5 linear dags with 10 tasks each, all in a sequence
   
       ```
       0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 -> 9 -> 10
       ```
   
   * 2 dags with a single root task and 100 downstream tasks all running in 
parallel
   
       ```
       0 -> (1_1, 1_2, 1_3, 1_4, ..., 1_100 -- parallel)
       ```
   
   * 5 branching dags with 1 root task, that has 5 children tasks, and each 
child task has 3 children tasks
   
       ```
                                                         0
                                                         |
                
_______________________________________________________________________________________
               |                    |                    |                     
|                       |
              1_1                  1_2                  1_3                   
1_4                     1_5
               |                    |                    |                     
|                       |
         _____________        _____________        _____________        
_______________         _______________
        |      |      |      |      |      |      |      |      |      |       
|       |       |       |       |
       2_1    2_2    2_3    2_4    2_5    2_6    2_7    2_8    2_9    2_10    
2_11    2_12    2_13    2_14    2_15
       ```
   
   I triggered all of them and also created multiple dag_runs for most of them. 
There wasn't much difference in the number of scheduler iterations but it 
consistently queued the number of tasks that it examined, in contrast to 
examining a very high number of tasks, only to queue just a few which is what 
happens without the patch.
   
   The difference in the time needed to run the dags, is noticeable.
   
   * with the patch
       * scheduler iterations: 150
       * total time: 248.28s
   * without the patch
       * scheduler iterations: 162
       * total time: 345.64s
   
   <img width="2238" height="974" alt="test_topologies" 
src="https://github.com/user-attachments/assets/89793622-2308-4429-bfeb-a6fd24317663";
 />
   
   
   ## Testing heavy load
   
   This is the original test where I created multiple dags with all tasks 
running in parallel without any dependencies between them.
   
   * dag_45_tasks
       * 45 parallel tasks
   * dag_250_tasks
       * 250 parallel tasks
   * dag_470_tasks
       * 470 parallel tasks
   * dag_1000_tasks
       * 1000 parallel tasks
   * dag_1100_tasks
       * 1100 parallel tasks
   * dag_1200_tasks
       * 1200 parallel tasks
   
   I reran this test because I made a lot of changes in my testing 
infrastructure and I wanted to verify that the new metrics are in line with the 
old ones that I previously shared.
   
   * with the patch
       * scheduler iterations: 402
       * max number of cuncurrent DRs: 6
       * total time: 675.25s
   * without the patch
       * scheduler iterations: 1343
       * max number of cuncurrent DRs: 3
       * total time: 1024.49s
   
   <img width="2238" height="974" alt="test_heavy_load" 
src="https://github.com/user-attachments/assets/8f18cfca-3203-4505-aa42-c9769d54d1a3";
 />
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to