IKholopov opened a new pull request, #27155: URL: https://github.com/apache/airflow/pull/27155
**Problem:** One of the challenges of running Celery workers in containerized environment is detecting system termination of a raw task instance. For example, if running task hits Airflow celery worker container memory limit and terminated by OOM killer, the only way for a DAG author to discover that task failed because of the memory pressure - is to guess it from the log entry "Task exited with return code negsignal.SIGKILL".  This message is a bit cryptic and from the point of the engineer responsible for Airflow infrastructure management (who is often a different person from the DAG authors) it would be much nicer to setup the dashboard that could display such events and setup alerts for them. Of course, it is possible to parse the log entries of all tasks, but this is a fragile invariant which would require additional tooling. **Proposed solution:** Introduce a metric for task instances raw task execution return codes. The proposed structure is a counter with the name: `ti.raw_task_return_code.<dag-id>.<task-id>.<return code>`. This will allow to both detect the changes in the frequency of particular return codes (like SIGKILL or SIGTERM) across the whole Airflow deployment and to scope down failures to particular tasks. *TODO:* Update documentation - I expect to have some discussion around the idea of this metric in this PR, so I want to have a consensus on the name/implementation before putting it in the docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
