Hey Teresa, I'm running Airflow on Kubernetes just to make sure we have process isolation. Even if one of the processes crashes (like with out-of-memory), kubernetes just restarts it. Maybe something to consider.
On Tue, Dec 27, 2016 at 12:36 PM Teresa Fontanella De Santis < [email protected]> wrote: > Bolke, > > Thanks for the answer! > > You are right about the process that takes up all remaining. But sometimes > it gets out of memory and sometimes not. > > We are running Airflow (both scheduler, the webserver and execute the dags) > on an EC2 instance m4.xlarge (we have 4 workers), using LocalExecutor. In > this case, should be better to use CeleryExecutor? > > Thanks again! > > 2016-12-26 18:41 GMT-03:00 Bolke de Bruin <[email protected]>: > > > We dont handle this kind of errors in airflow, so it becomes a hard error > > and airflow bails out. > > > > You are running out of memory most likely as some other process is taking > > up all remaining. Are you running workers on the same machine? These will > > go up and down with mem usage over time depending the jobs you launch. > > > > This is not related to "restarting the scheduler" (which is kind of > > outdated anyway). > > > > Bolke > > > > Sent from my iPhone > > > > > On 26 Dec 2016, at 21:47, Teresa Fontanella De Santis < > > [email protected]> wrote: > > > > > > Hi everyone! > > > > > > We were running the scheduler without problems for a while. We are > using > > > ec2 instance (mx4.large). We were running with airflow scheduler (no > > > supervisor.d, no monit, etc). > > > Suddenly, the scheduler stopped, showing this message: > > > > > > [2016-12-22 21:01:15,038] {jobs.py:574} INFO - Prioritizing 1 queued > > > jobs > > > [742/1767] > > > [2016-12-22 21:01:15,041] {jobs.py:603} INFO - Pool None has 128 > slots, 1 > > > task instances in > > > queue > > > > > > [2016-12-22 21:01:15,041] {models.py:154} INFO - Filling up the DagBag > > from > > > /home/ec2-user/analytics/airflow/dags > > > > > > [2016-12-22 21:01:15,155] {jobs.py:726} INFO - Starting 2 scheduler > > > jobs > > > > > > [2016-12-22 21:01:15,157] {jobs.py:761} ERROR - [Errno 12] Cannot > > allocate > > > memory > > > > > > Traceback (most recent call > > > last): > > > > > > File "/usr/local/lib/python3.5/site-packages/airflow/jobs.py", line > > 728, > > > in > > > _execute > > > > > > > > > j.start() > > > > > > File "/usr/lib64/python3.5/multiprocessing/process.py", line 105, in > > > start > > > > > > self._popen = > > > self._Popen(self) > > > > > > File "/usr/lib64/python3.5/multiprocessing/context.py", line 212, in > > > _Popen > > > > > > return > > > _default_context.get_context().Process._Popen(process_obj) > > > > > > File "/usr/lib64/python3.5/multiprocessing/context.py", line 267, in > > > _Popen > > > > > > return > > > Popen(process_obj) > > > > > > File "/usr/lib64/python3.5/multiprocessing/popen_fork.py", line 20, in > > > __init__ > > > > > > > > > self._launch(process_obj) > > > > > > File "/usr/lib64/python3.5/multiprocessing/popen_fork.py", line 67, in > > > _launch > > > > > > self.pid = > > > os.fork() > > > > > > OSError: [Errno 12] Cannot allocate > > > memory > > > > > > Traceback (most recent call last): > > > File "/usr/local/bin/airflow", line 15, in <module> > > > > > > > > > The dags which failed didn't show any log (there weren't stored on > > airflow > > > instance and there is no remote logs). So we don't have any idea of > what > > > would happened (only that there was not enough memory to fork) > > > It is well known that is recommended to restart the scheduler > > periodically > > > (according to this > > > <https://medium.com/handy-tech/airflow-tips-tricks-and- > > pitfalls-9ba53fba14eb#.80c6g1n1s>), > > > but... do you have any idea why this can happen? Is there something we > > can > > > do (or some bug we can fix)? > > > > > > > > > Thanks in advance! > > > -- _/ _/ Alex Van Boxel
