[ 
https://issues.apache.org/jira/browse/AIRFLOW-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Davidheiser updated AIRFLOW-2827:
---------------------------------------
    Description: 
We have a DAG with ~500 tasks, running on Airflow set up in Kubernetes with 
RabbitMQ using a setup derived pretty heavily from 
[https://github.com/mumoshu/kube-airflow.]  Occasionally, we will hit some 
spurious Celery execution failures (possibly related to 
https://issues.apache.org/jira/browse/AIRFLOW-2011 ), resulting in the Worker 
throwing errors that look like this:

 

 
{code:java}
[2018-07-30 11:04:26,812: ERROR/ForkPoolWorker-9] Task 
airflow.executors.celery_executor.execute_command[462de800-ad3f-4151-90bf-9155cc6c66f6]
 raised unexpected: AirflowException('Celery command failed',)
 Traceback (most recent call last):
   File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 382, 
in trace_task
     R = retval = fun(*args, **kwargs)
   File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 641, 
in _protected_call_
     return self.run(*args, **kwargs)
   File 
"/usr/local/lib/python2.7/dist-packages/airflow/executors/celery_executor.py", 
line 55, in execute_command
     raise AirflowException('Celery command failed')
 AirflowException: Celery command failed{code}
 

 

 

When these tasks fail, they send a "task failed" email that has very little 
information about the state of the task failure.  The logs for the task run are 
empty, because the task never actually did anything and the error message was 
generated by the worker.  Also, the task does not retry, so if something goes 
wrong with Celery, the task simply fails outright instead of trying again.

 

This may be the same issue reported in  
https://issues.apache.org/jira/browse/AIRFLOW-1844, but I am not sure because 
there is not much detail there.

  was:
We have a DAG with ~500 tasks, running on Airflow set up in Kubernetes with 
RabbitMQ using a setup derived pretty heavily from 
[https://github.com/mumoshu/kube-airflow.]  Occasionally, we will hit some 
spurious Celery execution failures (possibly related to #2011 ), resulting in 
the Worker throwing errors that look like this:

 

```[2018-07-30 11:04:26,812: ERROR/ForkPoolWorker-9] Task 
airflow.executors.celery_executor.execute_command[462de800-ad3f-4151-90bf-9155cc6c66f6]
 raised unexpected: AirflowException('Celery command failed',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 382, 
in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 641, 
in __protected_call__
    return self.run(*args, **kwargs)
  File 
"/usr/local/lib/python2.7/dist-packages/airflow/executors/celery_executor.py", 
line 55, in execute_command
    raise AirflowException('Celery command failed')
AirflowException: Celery command failed```

 

When these tasks fail, they send a "task failed" email that has very little 
information about the state of the task failure.  The logs for the task run are 
empty, because the task never actually did anything and the error message was 
generated by the worker.  Also, the task does not retry, so if something goes 
wrong with Celery, the task simply fails outright instead of trying again.

 

This may be the same issue reported in #1844, but I am not sure because there 
is not much detail there.


> Tasks that fail with spurious Celery issues are not retried
> -----------------------------------------------------------
>
>                 Key: AIRFLOW-2827
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2827
>             Project: Apache Airflow
>          Issue Type: Bug
>            Reporter: James Davidheiser
>            Priority: Major
>
> We have a DAG with ~500 tasks, running on Airflow set up in Kubernetes with 
> RabbitMQ using a setup derived pretty heavily from 
> [https://github.com/mumoshu/kube-airflow.]  Occasionally, we will hit some 
> spurious Celery execution failures (possibly related to 
> https://issues.apache.org/jira/browse/AIRFLOW-2011 ), resulting in the Worker 
> throwing errors that look like this:
>  
>  
> {code:java}
> [2018-07-30 11:04:26,812: ERROR/ForkPoolWorker-9] Task 
> airflow.executors.celery_executor.execute_command[462de800-ad3f-4151-90bf-9155cc6c66f6]
>  raised unexpected: AirflowException('Celery command failed',)
>  Traceback (most recent call last):
>    File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 
> 382, in trace_task
>      R = retval = fun(*args, **kwargs)
>    File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 
> 641, in _protected_call_
>      return self.run(*args, **kwargs)
>    File 
> "/usr/local/lib/python2.7/dist-packages/airflow/executors/celery_executor.py",
>  line 55, in execute_command
>      raise AirflowException('Celery command failed')
>  AirflowException: Celery command failed{code}
>  
>  
>  
> When these tasks fail, they send a "task failed" email that has very little 
> information about the state of the task failure.  The logs for the task run 
> are empty, because the task never actually did anything and the error message 
> was generated by the worker.  Also, the task does not retry, so if something 
> goes wrong with Celery, the task simply fails outright instead of trying 
> again.
>  
> This may be the same issue reported in  
> https://issues.apache.org/jira/browse/AIRFLOW-1844, but I am not sure because 
> there is not much detail there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to