I haven't seen this in Jira, but in case it's an old issue regression, or some 
corner case of our running some patches from 1.9.0 on 1.8.0 I thought I'd ask 
first.
I have a dags directory with some 3000+ dags in it. About 2000 of them are 
scheduled to start at the same time. I'm not sure I can change that much.
For a couple hours after that start time I get an error in the scheduler that 
ends up with it sending a sigint to all the child processes. But about 20 
seconds later Dag processing starts fresh again. I think the scheduler gets 
restarted. Trouble is, at sometimes it doesn't get through all the outstanding 
dags before this event that exits the processing, leading to huge inter-task 
gaps on random dags:

One example:
39d08a759c67a50ad55945acea0d7e3eb44b5da6fccd984a62db3cf8d59df749-json.log.2:{"log":"
    
transaction.rollback(_capture_exception=True)\n","stream":"stderr","time":"2017-11-12T18:11:40.266105133Z"}
39d08a759c67a50ad55945acea0d7e3eb44b5da6fccd984a62db3cf8d59df749-json.log.2:{"log":"
  File \"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py\", 
line 1402, in 
_handle_dbapi_exception\n","stream":"stderr","time":"2017-11-12T18:11:40.267115716Z"}
39d08a759c67a50ad55945acea0d7e3eb44b5da6fccd984a62db3cf8d59df749-json.log.2:{"log":"
    reraise(type(exception), exception, tb=exc_tb, 
cause=cause)\n","stream":"stderr","time":"2017-11-12T18:11:40.267451225Z"}
39d08a759c67a50ad55945acea0d7e3eb44b5da6fccd984a62db3cf8d59df749-json.log.2:{"log":"sqlalchemy.exc.OperationalError:
 (_mysql_exceptions.OperationalError) (1213, 'Deadlock found when trying to get 
lock; try restarting transaction') [SQL: u'UPDATE task_instance SET state=%s, 
queued_dttm=%s WHERE task_instance.task_id = %s AND task_instance.dag_id = %s 
AND task_instance.execution_date = %s'] [parameters: (u'queued', 
datetime.datetime(2017, 11, 12, 18, 11, 40, 152572), 'daily_load', 
'dagen_s3_ods_addon_basic_to_sandbox_v1', datetime.datetime(2017, 11, 11, 16, 
30))]\n","stream":"stderr","time":"2017-11-12T18:11:40.267871114Z"}

Reading that it looks like sqlalchemy and mysql are saying, hey, there's a lock 
in the InnoDB table index for that update, why don't you retry this? And 
airflow says I'm raising this and exiting.

Not sure why doesn't airflow (models.py?) doesn't retry that.
Also I'm not sure why there's a lock, since it clearly specified exact matches 
for all 3 primary keys. The execution_date is highly non-unique. And almost 
every dag has a "start" "daily_load" and "end" task, so that isn't either. The 
combination with dag name though, should be unique and lock only a row.

Has anyone seen this in their setup and resolved it?
Or does anyone have an idea how it should be addressed?

Thanks!
-Daniel

Reply via email to