[GitHub] [airflow] IKholopov opened a new pull request, #27155: Metric for raw task return codes

GitBox Wed, 19 Oct 2022 16:23:15 -0700


IKholopov opened a new pull request, #27155:
URL: https://github.com/apache/airflow/pull/27155


   **Problem:** One of the challenges of running Celery workers in 
containerized environment is detecting system termination of a raw task 
instance. 
   For example, if running task hits Airflow celery worker container memory 
limit and terminated by OOM killer, the only way for a DAG author to discover 
that task failed because of the memory pressure - is to guess it from the log 
entry "Task exited with return code negsignal.SIGKILL". 
   
![image](https://user-images.githubusercontent.com/2447492/196821258-e320070a-bca9-40c4-bfe3-b1f82abe7983.png)
   
   This message is a bit cryptic and from the point of the engineer responsible 
for Airflow infrastructure management (who is often a different person from the 
DAG authors) it would be much nicer to setup the dashboard that could display 
such events and setup alerts for them. Of course, it is possible to parse the 
log entries of all tasks, but this is a fragile invariant which would require 
additional tooling.
   
   **Proposed solution:**  Introduce a metric for task instances raw task 
execution return codes. The proposed structure is a counter with the name: 
`ti.raw_task_return_code.<dag-id>.<task-id>.<return code>`. This will allow to 
both detect the changes in the frequency of particular return codes (like 
SIGKILL or SIGTERM) across the whole Airflow deployment and to scope down 
failures to particular tasks.
   
   *TODO:* Update documentation - I expect to have some discussion around the 
idea of this metric in this PR, so I want to have a consensus on the 
name/implementation before putting it in the docs. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] IKholopov opened a new pull request, #27155: Metric for raw task return codes

Reply via email to