Hi Andrew, Something is indeed keeping a lock on the database and not freeing it. If it crashes can you find out what is exactly being locked by following the guidelines in this post: https://serverfault.com/questions/36260/how-can-i-show-mysql-locks <https://serverfault.com/questions/36260/how-can-i-show-mysql-locks>
As locks typically are very short (ms) I am wondering if you are sure that you are running just *one* scheduler and not trying to start multiple at the same time? Another option is that one of the tasks is misbehaving. Having the process list and associated PIDs (and programs) is key in tracking this down. Bolke > On 25 Oct 2017, at 12:42, Andrew Maguire <[email protected]> wrote: > > Hi, > > I'm not really sure exactly what is causing this by i'm noticing lots of dead > periods on our airflow server. Like this: > > > > And it corresponds with no work being actually done if i look at the database: > > > When i look in supervisor i typically see: > > > So the airflow scheduler seems to keep being restarted by supervisor. > > Then if i look in a recent stderr log from the scheduler i see: > Traceback (most recent call last): > File "/usr/local/bin/airflow", line 27, in <module> > args.func(args) > File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 828, > in scheduler > job.run() > File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 197, in > run > self._execute() > File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 1496, > in _execute > self._execute_helper(processor_manager) > File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 1541, > in _execute_helper > self.reset_state_for_orphaned_tasks(session=session) > File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, > in wrapper > result = func(*args, **kwargs) > File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 265, in > reset_state_for_orphaned_tasks > .filter(or_(*filter_for_tis), TI.state.in_(resettable_states)) > File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line 2679, in > all > File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line 2831, in > __iter__ > File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line 2854, in > _execute_and_instances > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 945, in > execute > File "build/bdist.linux-x86_64/egg/sqlalchemy/sql/elements.py", line 263, > in _execute_on_connection > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 1053, > in _execute_clauseelement > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 1189, > in _execute_context > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 1393, > in _handle_dbapi_exception > File "build/bdist.linux-x86_64/egg/sqlalchemy/util/compat.py", line 203, in > raise_from_cause > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 1182, > in _execute_context > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/default.py", line 470, > in do_execute > File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 205, > in execute > self.errorhandler(self, exc, value) > File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py", line > 36, in defaulterrorhandler > raise errorclass, errorvalue > sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1205, > 'Lock wait timeout exceeded; try restarting transaction') [SQL: u'SELECT > task_instance.task_id AS task_instance_task_id, ... > I'm not really sure what might be causing this or some ways to go about > debugging it. > I have been struggling through it by having a crontab do: > # reboot system a few times a day > 45 1,13 * * * /sbin/shutdown -r +5 > So rebooting the system a few times a day and then it typically comes back to > life and sorts itself out (if that's a technical term :) ) > So i'm now going to do the reboot every 4 hours as i've been caught a few > times when it goes into an error state a good while before the new reboot > from cron. > However i know this is a really horrible solution and i really should try > figure out what the root cause is and a better solution. > So i'm wondering if you might have any ideas or suggestions as to what might > be going on? > I'm running airplot 1.9.0alpha0 in a single google compute instance (Ubuntu > 16.04.3 LTS (GNU/Linux 4.4.0-97-generic x86_64) - 24 vCPUs, 100 GB memory) > with LocalExecutor as pretty much all our dags are just sequences of bigquery > operators. > Also using a Google Cloud MySql database. > It looks like some sort of lock when trying to update the database but i'm > not sure if that's just a symptom or more related to a cause. > Any help is much appreciated as i'm a data scientist and not a data engineer > so admittedly a bit (quite) out of my depth here. > Cheers, > Andy
