Strange subdag task SIGTERM issue

Robin Edwards Fri, 04 Jan 2019 03:39:23 -0800

Hello,

I have an issue with my subdags occasionally getting stuck which seems
to be related to their subtasks being terminated. The subdag task its
self will be marked as failed but the child tasks will be left in the
state 'up_for_retry'. I am using the SequentialExecutor for the SubDag
and the CeleryExecutor for task execution (RabbitMQ broker). We are
using Airflow 1.10.1.


In Sentry and my logs I see "AirflowException: LocalTaskJob received
SIGTERM signal" and I see: "AirflowException: Task received SIGTERM
signal" from what looks like the same command executing the child
task.

In the logs for the subdag task I see:

2019-01-04 00:21:21 INFO| Dependencies not met for <TaskInstance:
dag_name.subdag_name 2019-01-03T00:00:00+00:00 [running]>, dependency
'Task Instance Not Already Running' FAILED: Task is already running,
it started on 2019-01-04 00:18:41.550113+00:00.
2019-01-04 00:21:21 INFO| Dependencies not met for <TaskInstance:
dag_name.subdag_name 2019-01-03T00:00:00+00:00 [running]>, dependency
'Task Instance State' FAILED: Task is in the 'running' state which is
not a valid state for execution. The task must be cleared in order to
be run.

And the subdags child task log first attempt log is

2019-01-04 00:19:03 INFO| Executing <Task(PythonOperator):
subdag_task> on 2019-01-03T00:00:00+00:00
2019-01-04 00:19:03 INFO| Running: ['bash', '-c', 'airflow run
dag_name.subdag_name subdag_task 2019-01-03T00:00:00+00:00 --job_id
362042 --raw -sd /code/src/dags/mydag.py --cfg_path /tmp/tmp735cg6c3']
2019-01-04 00:19:37 INFO| Job 362042: Subtask subdag_task 2019-01-04
00:19:35 INFO| setting.configure_orm(): Using pool settings.
pool_size=15, pool_recycle=3600
2019-01-04 00:25:00 INFO| Sending Signals.SIGTERM to GPID 21683
2019-01-04 00:25:00 INFO| Process psutil.Process(pid=21683
(terminated)) (21683) terminated with exit code 15

The final attempt log is unavailable and the task is left in a state
of 'up_for_retry'. This isnt due to a timeout on the task or subdag
task being reached. We experience this issue where ever subdags are
used but not consistently. We do have fairly high DB CPU 70-80%
(Postgres) when this occurs and I am wondering if it could be caused
by a heart beat not arriving on time?

Our scheduler configuration is (the default):

job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5
run_duration = -1
min_file_process_interval = 0
dag_dir_list_interval = 300
print_stats_interval = 30
scheduler_zombie_task_threshold = 300
catchup_by_default = False
max_tis_per_query = 512
max_threads = 2
authenticate = False

Any help or pointers would be greatly appreciated,

Rob

Strange subdag task SIGTERM issue

Reply via email to