James Davidheiser created AIRFLOW-2827: ------------------------------------------
Summary: Tasks that fail with spurious Celery issues are not retried Key: AIRFLOW-2827 URL: https://issues.apache.org/jira/browse/AIRFLOW-2827 Project: Apache Airflow Issue Type: Wish Reporter: James Davidheiser We have a DAG with ~500 tasks, running on Airflow set up in Kubernetes with RabbitMQ using a setup derived pretty heavily from [https://github.com/mumoshu/kube-airflow.] Occasionally, we will hit some spurious Celery execution failures (possibly related to #2011 ), resulting in the Worker throwing errors that look like this: ```[2018-07-30 11:04:26,812: ERROR/ForkPoolWorker-9] Task airflow.executors.celery_executor.execute_command[462de800-ad3f-4151-90bf-9155cc6c66f6] raised unexpected: AirflowException('Celery command failed',) Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/airflow/executors/celery_executor.py", line 55, in execute_command raise AirflowException('Celery command failed') AirflowException: Celery command failed``` When these tasks fail, they send a "task failed" email that has very little information about the state of the task failure. The logs for the task run are empty, because the task never actually did anything and the error message was generated by the worker. Also, the task does not retry, so if something goes wrong with Celery, the task simply fails outright instead of trying again. This may be the same issue reported in #1844, but I am not sure because there is not much detail there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)