rafidka edited a comment on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-747639539


   I repeated the experiment on Airflow 2.0.0RC2 and produced more statistical 
data which you can find in the Excel sheet below, but here is a summary:
   
   scheduler_heartbeat_sec value|Average Frequency of scheduler_heartbeat metric
   -|-
   1|2.52057172
   2|3.85960122
   3|4.76818162
   5|6.49515588
   10|11.36376658
   30|30.89824894
   
   I tried this on the same machine I mentioned above (Amazon r5.4xlarge 
machine) so it is pretty powerful and I did confirm there isn't much load on 
the CPU (below 10% which is mainly the use of Airflow). I can retry this on a 
personal laptop if you feel you don't have strong confidence about results 
generated from a single machine (which, admittedly, I also feel so.)
   
   I cannot tell whether this is just a metrics issue or not, but I did look at 
the code and I do feel it is an actual scheduling issue, not just metrics 
(though I must admit my understanding of Airflow code base is still limited.) 
In my opinion, this justifies some investigation to see what is going on. In 
particular, I would like to suggest:
   
   1. Investigate whether this is just a metrics issue or indeed an issue with 
the scheduler.
   2. If it is a metric issue, I think it is important to fix. The 
scheduler_heartbeat is an important metric that can be used to judge the health 
of the system.
   3. If it is not just a metric issue, then that's probably even more 
important 😊 
   4. Admittedly, the higher the scheduler_heartbet_sec value, the more 
accurate the metric is (which suggests this is an actual scheduling issue not 
just metrics). So if no fix is intended, then at least the default value in 
airflow.cfg should be updated to, say, 30 seconds. In fact, if no fix is 
intended, I would even suggest putting a minimum value on 
`scheduler_heartbat_sec` config, or at least log a warning if the user specify 
a low value; there isn't much point in allowing the user to specify 5 seconds 
when they will get an average of 6.5 instead.
   
   I can help with this investigation if you agree with me that it is important 
to do (though probably won't be able to do so before the new year). Otherwise, 
feel free to resolve (though I still think at least point 4 above is important 
if we think that scheduler interval accuracy is not important.)
   
   ## Statistical Data for Different Runs
   
   Below is a snapshot of the Excel sheet I mentioned above. I can upload the 
Excel sheet file itself if you like, in which case please advise where I should 
upload it to.
   
   
![image](https://user-images.githubusercontent.com/442447/102529079-44c31c00-4054-11eb-81c5-08a4a17efac1.png)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to