[
https://issues.apache.org/jira/browse/AIRFLOW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154237#comment-16154237
]
Ash Berlin-Taylor commented on AIRFLOW-1555:
--------------------------------------------
Yes I think this might be the problem. From `redis-cli monitor` around when the
second task started up this appeared:
{noformat}
1504642233.665376 [1 10.128.2.196:53806] "watch" "unacked_mutex"
1504642233.665696 [1 10.128.2.196:53806] "get" "unacked_mutex"
1504642233.666012 [1 10.128.2.196:53806] "unwatch"
1504642233.666702 [1 10.128.2.196:53806] "multi"
1504642233.666726 [1 10.128.2.196:53806] "zadd" "unacked_index"
"1504642233.6628027" "93e866a7-20be-4070-bfcc-39012279756b"
1504642233.666742 [1 10.128.2.196:53806] "hset" "unacked"
"93e866a7-20be-4070-bfcc-39012279756b" "[{\"content-encoding\": \"utf-8\",
\"headers\": {\"expires\": null, \"lang\": \"py\", \"retries\": 0,
\"argsrepr\": \"['airflow run brand_in
dex_fetcher fetch_cn_unfiltered 2017-07-31T00:00:00 --local --pool
brand_index_api -sd /home/airflow/airflow/dags/brand_index.py']\", \"group\":
null, \"eta\": null, \"root_id\": \"d400e1cd-c086-4661-9821-2001bdcaf6fc\",
\"timelimit\": [n
ull, null], \"redelivered\": true, \"id\":
\"d400e1cd-c086-4661-9821-2001bdcaf6fc\", \"parent_id\": null, \"kwargsrepr\":
\"{}\", \"task\": \"airflow.executors.celery_executor.execute_command\",
\"origin\": \"gen17480@ip-10-128-2-196\"},
\"body\":
\"W1siYWlyZmxvdyBydW4gYnJhbmRfaW5kZXhfZmV0Y2hlciBmZXRjaF9jbl91bmZpbHRlcmVkIDIwMTctMDctMzFUMDA6MDA6MDAgLS1sb2NhbCAtLXBvb2wgYnJhbmRfaW5kZXhfYXBpIC1zZCAvaG9tZS9haXJmbG93L2FpcmZsb3cvZGFncy9icmFuZF9pbmRleC5weSJdLCB7fSwgeyJjaG9yZCI6IG
51bGwsICJlcnJiYWNrcyI6IG51bGwsICJjaGFpbiI6IG51bGwsICJjYWxsYmFja3MiOiBudWxsfV0=\",
\"properties\": {\"reply_to\": \"6401c86e-a426-3117-89de-645a85858e1c\",
\"delivery_info\": {\"routing_key\": \"default\", \"exchange\": \"\"},
\"body_encod
ing\": \"base64\", \"correlation_id\":
\"d400e1cd-c086-4661-9821-2001bdcaf6fc\", \"priority\": 0, \"delivery_tag\":
\"93e866a7-20be-4070-bfcc-39012279756b\", \"delivery_mode\": 2},
\"content-type\": \"application/json\"}, \"\", \"default\
"]"
{noformat}
which points to me at the linked timeout issue. Next I will try hacking locally
the Celery Executor and setting the visibility timeout to 24hours.
> Backfill job gets killed 1 hour after starting
> ----------------------------------------------
>
> Key: AIRFLOW-1555
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1555
> Project: Apache Airflow
> Issue Type: Bug
> Components: backfill
> Affects Versions: 1.8.1
> Environment: Airflow 1.8.1
> Celery 3.1.23 with one coordinator, redis and 3 workers
> Python 3.5.2
> Debian GNU/Linux 8.9 (jessie)
> snakebite uninstalled because it does not work with Python 3.5.2
> MySQL 5.6
> Reporter: Shreyas Joshi
>
> *What happens?*
> After running for an hour tasks in a backfill die. The task log shows:
> {code}
> ...
> [2017-08-31 06:48:06,425] {jobs.py:2172} WARNING - Recorded pid 5451 is not a
> descendant of the current pid 21571
> [2017-08-31 06:48:11,884] {jobs.py:2179} WARNING - State of this instance has
> been externally set to failed. Taking the poison pill. So long.
> [2017-08-31 06:48:11,892] {helpers.py:220} WARNING - Terminating descendant
> processes of [<REDACTED>] PID: 5451
> [2017-08-31 06:48:11,892] {helpers.py:224} WARNING - Terminating descendant
> process [<REDACTED>] PID: 5459
> [2017-08-31 06:48:11,896] {helpers.py:231} WARNING - Waiting up to 5s for
> processes to exit...
> ...
> {code}
> The backfill log shows:
> {code}
> ...
> [2017-08-31 11:23:44,025] {jobs.py:1729} ERROR - Executor reports task
> instance <TaskInstance: dag_name.task_name 2017-08-30 02:00:00 [running]>
> finished (failed) although the task says its running. Was the task killed
> externally?
> [2017-08-31 11:23:44,025] {models.py:1427} ERROR - Executor reports task
> instance <TaskInstance: analytics_events.page_views 2017-08-30 02:00:00
> [running]> finished (failed) although the task says its running. Was the task
> killed externally?
> ...
> {code}
> The Celery UI has the following exception, but status shows "success"
> {code}
> Traceback (most recent call last):
> File
> "/data/airflow-sources/.venv/lib/python3.5/site-packages/airflow/executors/celery_executor.py",
> line 56, in execute_command
> subprocess.check_call(command, shell=True)
> File "/usr/share/pyenv/versions/3.5.2/lib/python3.5/subprocess.py", line
> 581, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command 'airflow run dag_name task_name
> 2017-08-30T02:00:00 --pickle 14 --local' returned non-zero exit status 1
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File
> "/data/airflow-sources/.venv/lib/python3.5/site-packages/celery/app/trace.py",
> line 240, in trace_task
> R = retval = fun(*args, **kwargs)
> File
> "/data/airflow-sources/.venv/lib/python3.5/site-packages/celery/app/trace.py",
> line 438, in __protected_call__
> return self.run(*args, **kwargs)
> File
> "/data/airflow-sources/.venv/lib/python3.5/site-packages/airflow/executors/celery_executor.py",
> line 59, in execute_command
> raise AirflowException('Celery command failed')
> airflow.exceptions.AirflowException: Celery command failed
> {code}
> The tasks have timeouts explicitly set to 6 hours and SLA set to 5 hours. In
> the course of debugging this I also set dagrun_timeout to 6 hours. It did not
> make a difference.
> Here is a thread on [stackoverflow |
> https://stackoverflow.com/questions/44274381/airflow-long-running-task-in-subdag-marked-as-failed-after-an-hour]
> that talks about a very similar issue.
> These tasks run fine on our older Airflow 1.7. This is currently blocking our
> upgrade.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)