Hi,

I'm not really sure exactly what is causing this by i'm noticing lots of
dead periods on our airflow server. Like this:

[image: image.png]

And it corresponds with no work being actually done if i look at the
database:

[image: image.png]
When i look in supervisor i typically see:

[image: image.png]
So the airflow scheduler seems to keep being restarted by supervisor.

Then if i look in a recent stderr log from the scheduler i see:

Traceback (most recent call last):
  File "/usr/local/bin/airflow", line 27, in <module>
    args.func(args)
  File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py",
line 828, in scheduler
    job.run()
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
197, in run
    self._execute()
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
1496, in _execute
    self._execute_helper(processor_manager)
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
1541, in _execute_helper
    self.reset_state_for_orphaned_tasks(session=session)
  File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py",
line 50, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
265, in reset_state_for_orphaned_tasks
    .filter(or_(*filter_for_tis), TI.state.in_(resettable_states))
  File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line 2679, in all
  File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
2831, in __iter__
  File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
2854, in _execute_and_instances
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
945, in execute
  File "build/bdist.linux-x86_64/egg/sqlalchemy/sql/elements.py", line
263, in _execute_on_connection
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
1053, in _execute_clauseelement
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
1189, in _execute_context
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
1393, in _handle_dbapi_exception
  File "build/bdist.linux-x86_64/egg/sqlalchemy/util/compat.py", line
203, in raise_from_cause
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
1182, in _execute_context
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/default.py",
line 470, in do_execute
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py",
line 205, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py",
line 36, in defaulterrorhandler
    raise errorclass, errorvalue
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError)
(1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL:
u'SELECT task_instance.task_id AS task_instance_task_id, ...

I'm not really sure what might be causing this or some ways to go
about debugging it.

I have been struggling through it by having a crontab do:

# reboot system a few times a day
45 1,13 * * * /sbin/shutdown -r +5

So rebooting the system a few times a day and then it typically comes back
to life and sorts itself out (if that's a technical term :) )

So i'm now going to do the reboot every 4 hours as i've been caught a
few times when it goes into an error state a good while before the new
reboot from cron.

However i know this is a really horrible solution and i really should
try figure out what the root cause is and a better solution.

So i'm wondering if you might have any ideas or suggestions as to what
might be going on?

I'm running airplot 1.9.0alpha0 in a single google compute instance
(Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-97-generic x86_64) - 24 vCPUs,
100 GB memory) with LocalExecutor as pretty much all our dags are just
sequences of bigquery operators.

Also using a Google Cloud MySql database.

It looks like some sort of lock when trying to update the database but
i'm not sure if that's just a symptom or more related to a cause.

Any help is much appreciated as i'm a data scientist and not a data
engineer so admittedly a bit (quite) out of my depth here.

Cheers,

Andy

Reply via email to