lexmiln commented on issue #13637:
URL: https://github.com/apache/airflow/issues/13637#issuecomment-1421227055

   We also saw this 100% CPU issue in our kubernetes cluster.
   
   We later observed that liveness checks on the scheduler pod were 
consistently timing out.
   
   Manually running `time airflow jobs check` on the scheduler container (as 
the liveness probe does) showed that this command takes about a minute to run 
to completion with our configuration (500 mCPUs, 2GB RAM).
   
   Given this, we increased the scheduler liveness probe timeout and interval. 
   
   ```
     scheduler:
       # The liveness probe takes a while to run on our cluster due to limited
       # resources, so we run it only very occasionally, and we give it lots of
       # time to complete.
       livenessProbe:
         initialDelaySeconds: 120
         timeoutSeconds: 180
         failureThreshold: 5
         periodSeconds: 600
         command: ~
   ```
   
   Since we made this change, the pod averages around 250mCPU utilisation (ie. 
50% of its limit). A possible explanation for the CPU saturation that many are 
seeing is that the container is perpetually tied up trying to complete 
execution of the liveness probe in too short a window.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to