We have been using (1.7) over a year and never faced this issue. The moment we switched to 1.8, I think we have hit this issue. The reason why I saw "I think" is because I am not sure if it is the same issue. But whenever I restart, my pipeline proceeds.
*Airflow 1.7Having said that, In 1.7, I did face a similar issue (less than 5 times over a year): * *I saw that there were lot of processes marked "<defunct>" with parent process being "scheduler". * *Somebody mentioned it in this jira -> https://issues.apache.org/jira/browse/AIRFLOW-401 <https://issues.apache.org/jira/browse/AIRFLOW-401>* *Workaround: Restart scheduler* *Airflow 1.8:Now the issue in 1.8 may be different then the issue in 1.7 But again the issue get solved and pipeline progresses on a SCHEDULER RESTART.*If it may help, this is the trace in 1.8: [2017-03-22 19:35:16,332] {models.py:167} INFO - Filling up the DagBag from /usr/local/airflow/pipeline/pipeline.py [2017-03-22 19:35:22,451] {airflow_configuration.py:40} INFO - loading setup.cfg file [2017-03-22 19:35:51,041] {timeout.py:37} ERROR - Process timed out [2017-03-22 19:35:51,041] {models.py:266} ERROR - Failed to import: /usr/local/airflow/pipeline/pipeline.py Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263, in process_file m = imp.load_source(mod_name, filepath) File "/usr/local/airflow/pipeline/pipeline.py", line 167, in <module> create_tasks(dbguid, version, dag, override_start_date) File "/usr/local/airflow/pipeline/pipeline.py", line 104, in create_tasks t = create_task(dbguid, dag, taskInfo, version, override_date) File "/usr/local/airflow/pipeline/pipeline.py", line 85, in create_task retries, 1, depends_on_past, version, override_dag_date) File "/usr/local/airflow/pipeline/dags/base_pipeline.py", line 90, in create_python_operator depends_on_past=depends_on_past) File "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py", line 86, in wrapper result = func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/airflow/operators/python_operator.py", line 65, in __init__ super(PythonOperator, self).__init__(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py", line 70, in wrapper sig = signature(func) File "/usr/local/lib/python2.7/ dist-packages/funcsigs/__init__.py", line 105, in signature return Signature.from_function(obj) File "/usr/local/lib/python2.7/ dist-packages/funcsigs/__init__.py", line 594, in from_function __validate_parameters__=False) File "/usr/local/lib/python2.7/ dist-packages/funcsigs/__init__.py", line 518, in __init__ for param in parameters)) File "/usr/lib/python2.7/collections.py", line 52, in __init__ self.__update(*args, **kwds) File "/usr/lib/python2.7/_abcoll.py", line 548, in update self[key] = value File "/usr/lib/python2.7/collections.py", line 61, in __setitem__ last[1] = root[0] = self.__map[key] = [last, root, key] File "/usr/local/lib/python2.7/dist-packages/airflow/utils/timeout.py", line 38, in handle_timeout raise AirflowTaskTimeout(self.error_message) AirflowTaskTimeout: Timeout On Fri, Mar 24, 2017 at 5:45 PM, Bolke de Bruin <bdbr...@gmail.com> wrote: > We are running *without* num runs for over a year (and never have). It is > a very elusive issue which has not been reproducible. > > I like more info on this but it needs to be very elaborate even to the > point of access to the system exposing the behavior. > > Bolke > > Sent from my iPhone > > > On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote: > > > > We literally have a cron job that restarts the scheduler every 30 min. > Num > > runs didn't work consistently in rc4, sometimes it would restart itself > and > > sometimes we'd end up with a few zombie scheduler processes and things > > would get stuck. Also running locally, without celery. > > > >> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote: > >> > >> We have max runs set and still hit this. Our solution is dumber: > >> monitoring log output, and kill the scheduler if it stops emitting. > Works > >> like a charm. > >> > >>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <fhakan.ko...@gmail.com> > >> wrote: > >>> > >>> Some solutions to this problem is restarting the scheduler frequently > or > >>> some sort of monitoring on the scheduler. We have set up a dag that > pings > >>> cronitor <https://cronitor.io/> (a dead man's snitch type of service) > >> every > >>> 10 minutes and the snitch pages you when the scheduler dies and does > not > >>> send a ping to it. > >>> > >>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips < > aphill...@qrmedia.com> > >>> wrote: > >>> > >>>> We use celery and run into it from time to time. > >>>>> > >>>> > >>>> Bang goes my theory ;-) At least, assuming it's the same underlying > >>>> cause... > >>>> > >>>> Regards > >>>> > >>>> ap > >>>> > >> >