Hello, I have an issue with my subdags occasionally getting stuck which seems to be related to their subtasks being terminated. The subdag task its self will be marked as failed but the child tasks will be left in the state 'up_for_retry'. I am using the SequentialExecutor for the SubDag and the CeleryExecutor for task execution (RabbitMQ broker). We are using Airflow 1.10.1.
In Sentry and my logs I see "AirflowException: LocalTaskJob received SIGTERM signal" and I see: "AirflowException: Task received SIGTERM signal" from what looks like the same command executing the child task. In the logs for the subdag task I see: 2019-01-04 00:21:21 INFO| Dependencies not met for <TaskInstance: dag_name.subdag_name 2019-01-03T00:00:00+00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2019-01-04 00:18:41.550113+00:00. 2019-01-04 00:21:21 INFO| Dependencies not met for <TaskInstance: dag_name.subdag_name 2019-01-03T00:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run. And the subdags child task log first attempt log is 2019-01-04 00:19:03 INFO| Executing <Task(PythonOperator): subdag_task> on 2019-01-03T00:00:00+00:00 2019-01-04 00:19:03 INFO| Running: ['bash', '-c', 'airflow run dag_name.subdag_name subdag_task 2019-01-03T00:00:00+00:00 --job_id 362042 --raw -sd /code/src/dags/mydag.py --cfg_path /tmp/tmp735cg6c3'] 2019-01-04 00:19:37 INFO| Job 362042: Subtask subdag_task 2019-01-04 00:19:35 INFO| setting.configure_orm(): Using pool settings. pool_size=15, pool_recycle=3600 2019-01-04 00:25:00 INFO| Sending Signals.SIGTERM to GPID 21683 2019-01-04 00:25:00 INFO| Process psutil.Process(pid=21683 (terminated)) (21683) terminated with exit code 15 The final attempt log is unavailable and the task is left in a state of 'up_for_retry'. This isnt due to a timeout on the task or subdag task being reached. We experience this issue where ever subdags are used but not consistently. We do have fairly high DB CPU 70-80% (Postgres) when this occurs and I am wondering if it could be caused by a heart beat not arriving on time? Our scheduler configuration is (the default): job_heartbeat_sec = 5 scheduler_heartbeat_sec = 5 run_duration = -1 min_file_process_interval = 0 dag_dir_list_interval = 300 print_stats_interval = 30 scheduler_zombie_task_threshold = 300 catchup_by_default = False max_tis_per_query = 512 max_threads = 2 authenticate = False Any help or pointers would be greatly appreciated, Rob
