Yeah, that makes sense - i think in my rush to upgrade to 1.9 i at some stage ran "airflow upgradedb" and it just did nothing (and then i killed it). I think i also ran it while the scheduler was still active reading and writing to the database. Also was a day after i upgraded with pip that i realised in needed to run upgradedb so all in all i think i messed up the upgrade.
I have done an uninstall and reinstall to 1.8.2 and just pointed it at a new database :) seems to be ok now. Think i'll need to be a bit more careful in upgrading especially when requires a database change. Will probably stay on 1.8.2 for now as we hoping to get a data engineer in to help with all this. Cheers, Andy On Wed, Oct 25, 2017 at 8:11 PM Joy Gao <[email protected]> wrote: > I've had something similar happening to me in the past (`airflow upgradedb` > stuck forever waiting for a lock on a model). Because some migration > scripts require querying the db, it got stuck waiting for the lock on a > table. > > For me it had to do with calling a model's class method in a dag file, > which threw an error after an ORM query, and did not properly release the > lock on the table. As Bolke mentioned, I was able to track down the exact > locking table/query by using `show engine innodb status;`. > > Hope it helps! > > On Wed, Oct 25, 2017 at 4:41 AM, Bolke de Bruin <[email protected]> wrote: > > > Hi Andrew, > > > > Something is indeed keeping a lock on the database and not freeing it. If > > it crashes can you find out what is exactly being locked by following the > > guidelines in this post: https://serverfault.com/ > > questions/36260/how-can-i-show-mysql-locks <https://serverfault.com/ > > questions/36260/how-can-i-show-mysql-locks> > > > > As locks typically are very short (ms) I am wondering if you are sure > that > > you are running just *one* scheduler and not trying to start multiple at > > the same time? Another option is that one of the tasks is misbehaving. > > > > Having the process list and associated PIDs (and programs) is key in > > tracking this down. > > > > Bolke > > > > > > > On 25 Oct 2017, at 12:42, Andrew Maguire <[email protected]> > wrote: > > > > > > Hi, > > > > > > I'm not really sure exactly what is causing this by i'm noticing lots > of > > dead periods on our airflow server. Like this: > > > > > > > > > > > > And it corresponds with no work being actually done if i look at the > > database: > > > > > > > > > When i look in supervisor i typically see: > > > > > > > > > So the airflow scheduler seems to keep being restarted by supervisor. > > > > > > Then if i look in a recent stderr log from the scheduler i see: > > > Traceback (most recent call last): > > > File "/usr/local/bin/airflow", line 27, in <module> > > > args.func(args) > > > File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", > > line 828, in scheduler > > > job.run() > > > File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line > > 197, in run > > > self._execute() > > > File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line > > 1496, in _execute > > > self._execute_helper(processor_manager) > > > File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line > > 1541, in _execute_helper > > > self.reset_state_for_orphaned_tasks(session=session) > > > File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", > > line 50, in wrapper > > > result = func(*args, **kwargs) > > > File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line > > 265, in reset_state_for_orphaned_tasks > > > .filter(or_(*filter_for_tis), TI.state.in_(resettable_states)) > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line > > 2679, in all > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line > > 2831, in __iter__ > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line > > 2854, in _execute_and_instances > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line > > 945, in execute > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/sql/elements.py", line > > 263, in _execute_on_connection > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line > > 1053, in _execute_clauseelement > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line > > 1189, in _execute_context > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line > > 1393, in _handle_dbapi_exception > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/util/compat.py", line > > 203, in raise_from_cause > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line > > 1182, in _execute_context > > > File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/default.py", > > line 470, in do_execute > > > File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py", > > line 205, in execute > > > self.errorhandler(self, exc, value) > > > File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py", > > line 36, in defaulterrorhandler > > > raise errorclass, errorvalue > > > sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) > > (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: > > u'SELECT task_instance.task_id AS task_instance_task_id, ... > > > I'm not really sure what might be causing this or some ways to go about > > debugging it. > > > I have been struggling through it by having a crontab do: > > > # reboot system a few times a day > > > 45 1,13 * * * /sbin/shutdown -r +5 > > > So rebooting the system a few times a day and then it typically comes > > back to life and sorts itself out (if that's a technical term :) ) > > > So i'm now going to do the reboot every 4 hours as i've been caught a > > few times when it goes into an error state a good while before the new > > reboot from cron. > > > However i know this is a really horrible solution and i really should > > try figure out what the root cause is and a better solution. > > > So i'm wondering if you might have any ideas or suggestions as to what > > might be going on? > > > I'm running airplot 1.9.0alpha0 in a single google compute instance > > (Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-97-generic x86_64) - 24 vCPUs, 100 > GB > > memory) with LocalExecutor as pretty much all our dags are just sequences > > of bigquery operators. > > > Also using a Google Cloud MySql database. > > > It looks like some sort of lock when trying to update the database but > > i'm not sure if that's just a symptom or more related to a cause. > > > Any help is much appreciated as i'm a data scientist and not a data > > engineer so admittedly a bit (quite) out of my depth here. > > > Cheers, > > > Andy > > > > >
