hanxdatadog opened a new issue, #54850:
URL: https://github.com/apache/airflow/issues/54850

   ### Description
   
   đź‘‹ Dear Airflow community,
   
   Recently we ran some stress tests on Airflow’s asset-based scheduling and 
noticed that the webserver was frequently restarting due to liveness probe 
failures. The liveness probe we were using was:
   ```
   /api/v2/monitor/health
   ```
   
   This was based on the guidance from the old health endpoint response:
   
https://github.com/apache/airflow/blob/31f0eac1e15fee842d451d56c603d9005c30ddcb/airflow-core/src/airflow/api_fastapi/core_api/app.py#L85
   
   From reading the source code, my understanding is that 
`/api/v2/monitor/health` checks the overall health of the metadatabase, 
scheduler, and triggerer. If there’s any slowdown in retrieving health 
information from these components, the webserver gets restarted, which makes 
the UI unavailable. Ideally, we’d like the UI to remain available even if the 
metadb or scheduler is under heavy load.
   
   What would be the recommended alternative liveness check that doesn’t make 
the webserver’s health dependent on backend components? I see some options, 
such as the execution API health endpoint:
   
https://github.com/apache/airflow/blob/31f0eac1e15fee842d451d56c603d9005c30ddcb/airflow-core/src/airflow/api_fastapi/execution_api/routes/health.py#L30
   
   I also noticed that the official chart for the API server uses the version 
endpoint:
   
https://github.com/apache/airflow/blob/31f0eac1e15fee842d451d56c603d9005c30ddcb/chart/templates/api-server/api-server-deployment.yaml#L194
   
   Any suggestions or guidance would be much appreciated 🙏
   
   ### Use case/motivation
   
   A liveness probe check API end point for webserver that is not dependent on 
other components
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to