I haven't seen this in Jira, but in case it's an old issue regression, or some
corner case of our running some patches from 1.9.0 on 1.8.0 I thought I'd ask
first.
I have a dags directory with some 3000+ dags in it. About 2000 of them are
scheduled to start at the same time. I'm not sure I can change that much.
For a couple hours after that start time I get an error in the scheduler that
ends up with it sending a sigint to all the child processes. But about 20
seconds later Dag processing starts fresh again. I think the scheduler gets
restarted. Trouble is, at sometimes it doesn't get through all the outstanding
dags before this event that exits the processing, leading to huge inter-task
gaps on random dags:
One example:
39d08a759c67a50ad55945acea0d7e3eb44b5da6fccd984a62db3cf8d59df749-json.log.2:{"log":"
transaction.rollback(_capture_exception=True)\n","stream":"stderr","time":"2017-11-12T18:11:40.266105133Z"}
39d08a759c67a50ad55945acea0d7e3eb44b5da6fccd984a62db3cf8d59df749-json.log.2:{"log":"
File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py\",
line 1402, in
_handle_dbapi_exception\n","stream":"stderr","time":"2017-11-12T18:11:40.267115716Z"}
39d08a759c67a50ad55945acea0d7e3eb44b5da6fccd984a62db3cf8d59df749-json.log.2:{"log":"
reraise(type(exception), exception, tb=exc_tb,
cause=cause)\n","stream":"stderr","time":"2017-11-12T18:11:40.267451225Z"}
39d08a759c67a50ad55945acea0d7e3eb44b5da6fccd984a62db3cf8d59df749-json.log.2:{"log":"sqlalchemy.exc.OperationalError:
(_mysql_exceptions.OperationalError) (1213, 'Deadlock found when trying to get
lock; try restarting transaction') [SQL: u'UPDATE task_instance SET state=%s,
queued_dttm=%s WHERE task_instance.task_id = %s AND task_instance.dag_id = %s
AND task_instance.execution_date = %s'] [parameters: (u'queued',
datetime.datetime(2017, 11, 12, 18, 11, 40, 152572), 'daily_load',
'dagen_s3_ods_addon_basic_to_sandbox_v1', datetime.datetime(2017, 11, 11, 16,
30))]\n","stream":"stderr","time":"2017-11-12T18:11:40.267871114Z"}
Reading that it looks like sqlalchemy and mysql are saying, hey, there's a lock
in the InnoDB table index for that update, why don't you retry this? And
airflow says I'm raising this and exiting.
Not sure why doesn't airflow (models.py?) doesn't retry that.
Also I'm not sure why there's a lock, since it clearly specified exact matches
for all 3 primary keys. The execution_date is highly non-unique. And almost
every dag has a "start" "daily_load" and "end" task, so that isn't either. The
combination with dag name though, should be unique and lock only a row.
Has anyone seen this in their setup and resolved it?
Or does anyone have an idea how it should be addressed?
Thanks!
-Daniel