[ https://issues.apache.org/jira/browse/AIRFLOW-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Davidheiser updated AIRFLOW-2827: --------------------------------------- Description: We have a DAG with ~500 tasks, running on Airflow set up in Kubernetes with RabbitMQ using a setup derived pretty heavily from [https://github.com/mumoshu/kube-airflow.] Occasionally, we will hit some spurious Celery execution failures (possibly related to https://issues.apache.org/jira/browse/AIRFLOW-2011 ), resulting in the Worker throwing errors that look like this: {code:java} [2018-07-30 11:04:26,812: ERROR/ForkPoolWorker-9] Task airflow.executors.celery_executor.execute_command[462de800-ad3f-4151-90bf-9155cc6c66f6] raised unexpected: AirflowException('Celery command failed',) Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 641, in _protected_call_ return self.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/airflow/executors/celery_executor.py", line 55, in execute_command raise AirflowException('Celery command failed') AirflowException: Celery command failed{code} When these tasks fail, they send a "task failed" email that has very little information about the state of the task failure. The logs for the task run are empty, because the task never actually did anything and the error message was generated by the worker. Also, the task does not retry, so if something goes wrong with Celery, the task simply fails outright instead of trying again. This may be the same issue reported in https://issues.apache.org/jira/browse/AIRFLOW-1844, but I am not sure because there is not much detail there. was: We have a DAG with ~500 tasks, running on Airflow set up in Kubernetes with RabbitMQ using a setup derived pretty heavily from [https://github.com/mumoshu/kube-airflow.] Occasionally, we will hit some spurious Celery execution failures (possibly related to #2011 ), resulting in the Worker throwing errors that look like this: ```[2018-07-30 11:04:26,812: ERROR/ForkPoolWorker-9] Task airflow.executors.celery_executor.execute_command[462de800-ad3f-4151-90bf-9155cc6c66f6] raised unexpected: AirflowException('Celery command failed',) Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/airflow/executors/celery_executor.py", line 55, in execute_command raise AirflowException('Celery command failed') AirflowException: Celery command failed``` When these tasks fail, they send a "task failed" email that has very little information about the state of the task failure. The logs for the task run are empty, because the task never actually did anything and the error message was generated by the worker. Also, the task does not retry, so if something goes wrong with Celery, the task simply fails outright instead of trying again. This may be the same issue reported in #1844, but I am not sure because there is not much detail there. > Tasks that fail with spurious Celery issues are not retried > ----------------------------------------------------------- > > Key: AIRFLOW-2827 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2827 > Project: Apache Airflow > Issue Type: Bug > Reporter: James Davidheiser > Priority: Major > > We have a DAG with ~500 tasks, running on Airflow set up in Kubernetes with > RabbitMQ using a setup derived pretty heavily from > [https://github.com/mumoshu/kube-airflow.] Occasionally, we will hit some > spurious Celery execution failures (possibly related to > https://issues.apache.org/jira/browse/AIRFLOW-2011 ), resulting in the Worker > throwing errors that look like this: > > > {code:java} > [2018-07-30 11:04:26,812: ERROR/ForkPoolWorker-9] Task > airflow.executors.celery_executor.execute_command[462de800-ad3f-4151-90bf-9155cc6c66f6] > raised unexpected: AirflowException('Celery command failed',) > Traceback (most recent call last): > File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line > 382, in trace_task > R = retval = fun(*args, **kwargs) > File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line > 641, in _protected_call_ > return self.run(*args, **kwargs) > File > "/usr/local/lib/python2.7/dist-packages/airflow/executors/celery_executor.py", > line 55, in execute_command > raise AirflowException('Celery command failed') > AirflowException: Celery command failed{code} > > > > When these tasks fail, they send a "task failed" email that has very little > information about the state of the task failure. The logs for the task run > are empty, because the task never actually did anything and the error message > was generated by the worker. Also, the task does not retry, so if something > goes wrong with Celery, the task simply fails outright instead of trying > again. > > This may be the same issue reported in > https://issues.apache.org/jira/browse/AIRFLOW-1844, but I am not sure because > there is not much detail there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)