Hello Bolke,

Thanks a lot for your answer. We are using Airflow version 1.7.0 so that
might explain the issue. I will try upgrading to a newer version. I looked
at the DB metrics and there never seemed to be more than 20 open
connections which is under the limit set for the DB but I will look at the
DB configuration again.

I expected your question about the SequentialExecutor :) Our limitations
for running jobs in parallel right now are not Airflow related, but related
to our cluster configuration. We are using Airflow to schedule Spark jobs
and when multiple jobs are scheduled in parallel our ResourceManager only
assigns resources to one at time, and eventually hangs and needs
restarting. We are working on this and once we have it figured out we will
change to LocalExecutor. However, I would expect that using the
SequentialExecutor should have a smaller impact on the number of open
connections to the DB, than the LocalExecutor, or?

Thanks again for the help and advice,

Cheers,

Tamara

On Thu, Jul 14, 2016 at 7:44 PM, Bolke de Bruin <[email protected]> wrote:

> Hi Tamara,
>
> Please supply version information.
>
> With regards to your issue a “connection timed out” normally means that
> the database server became unreachable or too busy. So I would look at the
> db first. Are your connections being exhausted (older versions of Airflow
> were not to great a closing connections and we still have work to do in
> that area). If you run something older than 1.7.1.3 you could consider
> using “—num-runs X”. Where the scheduler will quit after X runs. Then
> supervisord needs to restart the scheduler obviously.
>
> Moreover, why are you using the SequentialExecutor? LocalExecutor will
> allow you to parallelize your tasks and scale vertically, CeleryExecutor
> will do the same but also allow you to scale horizontally at the cost of a
> slightly more complex setup.
>
> Regards,
> Bolke
>
> > Op 14 jul. 2016, om 11:44 heeft Tamara Mendt <[email protected]> het
> volgende geschreven:
> >
> > Hello,
> >
> > Sorry for writing this in the dev list, but as there is no user list yet
> I
> > decided this is the best place. We are currently running Airflow with a
> > SequentialExecutor and a Postgres DB in the backend. We run the airflow
> > scheduler and webserver using supervisor so that they should be
> > automatically restarted if either fails.
> >
> > Normally this setting works fine. However, we have noticed that sometimes
> > the scheduler stops scheduling jobs and only starts rescheduling them if
> we
> > manually restart it from supervisor. I could see this message in the
> > airflow scheduler error logs, so the reason the scheduler stops
> scheduling
> > seems to be related to the connection to the DB:
> >
> > <class 'sqlalchemy.exc.DatabaseError'> (psycopg2.DatabaseError) SSL
> SYSCALL
> > error: Connection timed out
> > [SQL: 'UPDATE job SET latest_heartbeat=%(latest_heartbeat)s WHERE job.id
> =
> > %(job_id)s'] [parameters: {'latest_heartbeat': datetime.datetime(2016, 7,
> > 11, 10, 26, 7, 44521), 'job_id': 10246}]
> >
> > Also, when I look for the job id in the Airflow DB I can see the
> following:
> >
> >  id   | dag_id |  state  |   job_type   |         start_date         |
> > end_date |      latest_heartbeat      |   executor_class   |
> >
> -------+--------+---------+--------------+----------------------------+----------+----------------------------+--------------------+----------+
> > 10246 |        | running | SchedulerJob | 2016-07-08 15:38:06.911346
> > |          | 2016-07-14 05:30:56.407149 | SequentialExecutor |
> >
> > The latest heartbeat corresponds to the moment when the scheduler stopped
> > scheduling jobs. Our supervisor configuration for the scheduler is the
> > following:
> >
> > [program:airflow-scheduler]
> > command= airflow scheduler
> > autostart=true
> > autorestart=true
> > startretries=3
> > stderr_logfile=/var/logs/airflow-logs/airflow-scheduler.err.log
> > stdout_logfile=/var/logs/airflow-logs/airflow-scheduler.out.log
> >
> > I have added these two lines now to the supervisor configuration in case
> > the problem was that supervisor was not tracing that the scheduler had
> quit:
> >
> > stopsignal=QUIT
> > stopasgroup=true
> >
> > If anyone has had a similar problem, or any other ideas as to how we
> could
> > avoid the need to manually restart the scheduler and also what could be
> > causing the scheduler to stop in the first place, they would be much
> > appreciated.
> >
> > Cheers,
>

Reply via email to