Hey Teresa,

I'm running Airflow on Kubernetes just to make sure we have process
isolation. Even if one of the processes crashes (like with out-of-memory),
kubernetes just restarts it. Maybe something to consider.

On Tue, Dec 27, 2016 at 12:36 PM Teresa Fontanella De Santis <
[email protected]> wrote:

> Bolke,
>
> Thanks for the answer!
>
> You are right about the process that takes up all remaining. But sometimes
> it gets out of memory and sometimes not.
>
> We are running Airflow (both scheduler, the webserver and execute the dags)
> on an EC2 instance m4.xlarge (we have 4 workers), using LocalExecutor. In
> this case, should be better to use CeleryExecutor?
>
> Thanks again!
>
> 2016-12-26 18:41 GMT-03:00 Bolke de Bruin <[email protected]>:
>
> > We dont handle this kind of errors in airflow, so it becomes a hard error
> > and airflow bails out.
> >
> > You are running out of memory most likely as some other process is taking
> > up all remaining. Are you running workers on the same machine? These will
> > go up and down with mem usage over time depending the jobs you launch.
> >
> > This is not related to "restarting the scheduler" (which is kind of
> > outdated anyway).
> >
> > Bolke
> >
> > Sent from my iPhone
> >
> > > On 26 Dec 2016, at 21:47, Teresa Fontanella De Santis <
> > [email protected]> wrote:
> > >
> > > Hi everyone!
> > >
> > > We were running the scheduler without problems for a while. We are
> using
> > > ec2 instance (mx4.large). We were running with airflow scheduler (no
> > > supervisor.d, no monit, etc).
> > > Suddenly, the scheduler stopped, showing this message:
> > >
> > > [2016-12-22 21:01:15,038] {jobs.py:574} INFO - Prioritizing 1 queued
> > > jobs
> > > [742/1767]
> > > [2016-12-22 21:01:15,041] {jobs.py:603} INFO - Pool None has 128
> slots, 1
> > > task instances in
> > > queue
> > >
> > > [2016-12-22 21:01:15,041] {models.py:154} INFO - Filling up the DagBag
> > from
> > > /home/ec2-user/analytics/airflow/dags
> > >
> > > [2016-12-22 21:01:15,155] {jobs.py:726} INFO - Starting 2 scheduler
> > > jobs
> > >
> > > [2016-12-22 21:01:15,157] {jobs.py:761} ERROR - [Errno 12] Cannot
> > allocate
> > > memory
> > >
> > > Traceback (most recent call
> > > last):
> > >
> > >  File "/usr/local/lib/python3.5/site-packages/airflow/jobs.py", line
> > 728,
> > > in
> > > _execute
> > >
> > >
> > > j.start()
> > >
> > >  File "/usr/lib64/python3.5/multiprocessing/process.py", line 105, in
> > > start
> > >
> > >    self._popen =
> > > self._Popen(self)
> > >
> > >  File "/usr/lib64/python3.5/multiprocessing/context.py", line 212, in
> > > _Popen
> > >
> > >    return
> > > _default_context.get_context().Process._Popen(process_obj)
> > >
> > >  File "/usr/lib64/python3.5/multiprocessing/context.py", line 267, in
> > > _Popen
> > >
> > >    return
> > > Popen(process_obj)
> > >
> > >  File "/usr/lib64/python3.5/multiprocessing/popen_fork.py", line 20, in
> > > __init__
> > >
> > >
> > > self._launch(process_obj)
> > >
> > >  File "/usr/lib64/python3.5/multiprocessing/popen_fork.py", line 67, in
> > > _launch
> > >
> > >    self.pid =
> > > os.fork()
> > >
> > > OSError: [Errno 12] Cannot allocate
> > > memory
> > >
> > > Traceback (most recent call last):
> > >  File "/usr/local/bin/airflow", line 15, in <module>
> > >
> > >
> > > The dags which failed didn't show any log (there weren't stored on
> > airflow
> > > instance and there is no remote logs). So we don't have any idea of
> what
> > > would happened (only that there was not enough memory to fork)
> > > It is well known that is recommended to restart the scheduler
> > periodically
> > > (according to this
> > > <https://medium.com/handy-tech/airflow-tips-tricks-and-
> > pitfalls-9ba53fba14eb#.80c6g1n1s>),
> > > but... do you have any idea why this can happen? Is there something we
> > can
> > > do (or some bug we can fix)?
> > >
> > >
> > > Thanks in advance!
> >
>
-- 
  _/
_/ Alex Van Boxel

Reply via email to