laserpedro edited a comment on issue #18041:
URL: https://github.com/apache/airflow/issues/18041#issuecomment-934113162


   In my case I have some pattern on failures:
   
   Case 1: a Task in the dag takes some time to finish (because it is doing 
some computations or inserting a large amount of data in a db) and the 
execution time is >= heartbeat signal. After having incorporated [this 
patch](https://github.com/apache/airflow/pull/16289/files) that was supposed to 
fix this issue I was still getting this error. The CPU usage was low both on 
the scheduler and on postgres, therefore not resource related ... After 
checking I found 
[this](https://stackoverflow.com/questions/65380492/why-are-my-airflow-tasks-being-externally-set-to-failed/65380493#65380493)
 on stackoveflow and adjusted my config so that now:
   
   ``` 
   scheduler_heartbeat_sec = 200
   
   scheduler_health_check_threshold = 600
   ```
   
   I have relaunched the dags that were long to process (by long I mean exec 
time > heartbeat interval) and for the moment I have not received any SIGTERM 
signal.
   
   Case 2: a inherited class of BaseOperator was hammering the scheduler by 
using a `poke_interval` < 1 min whereas it is not recommended at all by the 
official documentation when used in `poke` mode.
   
   By fixing the interval on the sensors and modifiying the config and 
incorporating the fix I finally seem to have somehting that looks stable using 
airflow > 2.0.0.
   
   
   I wish I could give a more technical solution on this ... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to