[ https://issues.apache.org/jira/browse/AIRFLOW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anonymous reassigned AIRFLOW-1555: ---------------------------------- Assignee: Shreyas Joshi > Backfill job gets killed 1 hour after starting > ---------------------------------------------- > > Key: AIRFLOW-1555 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1555 > Project: Apache Airflow > Issue Type: Bug > Components: backfill > Affects Versions: 1.8.1 > Environment: Airflow 1.8.1 > Celery 3.1.23 with one coordinator, redis and 3 workers > Python 3.5.2 > Debian GNU/Linux 8.9 (jessie) > snakebite uninstalled because it does not work with Python 3.5.2 > MySQL 5.6 > Reporter: Shreyas Joshi > Assignee: Shreyas Joshi > Priority: Major > Fix For: 1.10.0 > > > *What happens?* > After running for an hour tasks in a backfill die. The task log shows: > {code} > ... > [2017-08-31 06:48:06,425] {jobs.py:2172} WARNING - Recorded pid 5451 is not a > descendant of the current pid 21571 > [2017-08-31 06:48:11,884] {jobs.py:2179} WARNING - State of this instance has > been externally set to failed. Taking the poison pill. So long. > [2017-08-31 06:48:11,892] {helpers.py:220} WARNING - Terminating descendant > processes of [<REDACTED>] PID: 5451 > [2017-08-31 06:48:11,892] {helpers.py:224} WARNING - Terminating descendant > process [<REDACTED>] PID: 5459 > [2017-08-31 06:48:11,896] {helpers.py:231} WARNING - Waiting up to 5s for > processes to exit... > ... > {code} > The backfill log shows: > {code} > ... > [2017-08-31 11:23:44,025] {jobs.py:1729} ERROR - Executor reports task > instance <TaskInstance: dag_name.task_name 2017-08-30 02:00:00 [running]> > finished (failed) although the task says its running. Was the task killed > externally? > [2017-08-31 11:23:44,025] {models.py:1427} ERROR - Executor reports task > instance <TaskInstance: analytics_events.page_views 2017-08-30 02:00:00 > [running]> finished (failed) although the task says its running. Was the task > killed externally? > ... > {code} > The Celery UI has the following exception, but status shows "success" > {code} > Traceback (most recent call last): > File > "/data/airflow-sources/.venv/lib/python3.5/site-packages/airflow/executors/celery_executor.py", > line 56, in execute_command > subprocess.check_call(command, shell=True) > File "/usr/share/pyenv/versions/3.5.2/lib/python3.5/subprocess.py", line > 581, in check_call > raise CalledProcessError(retcode, cmd) > subprocess.CalledProcessError: Command 'airflow run dag_name task_name > 2017-08-30T02:00:00 --pickle 14 --local' returned non-zero exit status 1 > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/data/airflow-sources/.venv/lib/python3.5/site-packages/celery/app/trace.py", > line 240, in trace_task > R = retval = fun(*args, **kwargs) > File > "/data/airflow-sources/.venv/lib/python3.5/site-packages/celery/app/trace.py", > line 438, in __protected_call__ > return self.run(*args, **kwargs) > File > "/data/airflow-sources/.venv/lib/python3.5/site-packages/airflow/executors/celery_executor.py", > line 59, in execute_command > raise AirflowException('Celery command failed') > airflow.exceptions.AirflowException: Celery command failed > {code} > The tasks have timeouts explicitly set to 6 hours and SLA set to 5 hours. In > the course of debugging this I also set dagrun_timeout to 6 hours. It did not > make a difference. > Here is a thread on [stackoverflow | > https://stackoverflow.com/questions/44274381/airflow-long-running-task-in-subdag-marked-as-failed-after-an-hour] > that talks about a very similar issue. > These tasks run fine on our older Airflow 1.7. This is currently blocking our > upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)