Hi Harish, The below does *not* indicate a scheduler hang, it is a valid exception as mentioned earlier.
Bolke. > On 24 Mar 2017, at 19:07, harish singh <harish.sing...@gmail.com> wrote: > > We have been using (1.7) over a year and never faced this issue. > The moment we switched to 1.8, I think we have hit this issue. > The reason why I saw "I think" is because I am not sure if it is the same > issue. But whenever I restart, my pipeline proceeds. > > > > *Airflow 1.7Having said that, In 1.7, I did face a similar issue (less than > 5 times over a year): * > *I saw that there were lot of processes marked "<defunct>" with parent > process being "scheduler". * > > *Somebody mentioned it in this jira -> > https://issues.apache.org/jira/browse/AIRFLOW-401 > <https://issues.apache.org/jira/browse/AIRFLOW-401>* > *Workaround: Restart scheduler* > > > > > *Airflow 1.8:Now the issue in 1.8 may be different then the issue in > 1.7 But again the issue get solved and pipeline progresses on a SCHEDULER > RESTART.*If it may help, this is the trace in 1.8: > [2017-03-22 19:35:16,332] {models.py:167} INFO - Filling up the DagBag from > /usr/local/airflow/pipeline/pipeline.py [2017-03-22 19:35:22,451] > {airflow_configuration.py:40} INFO - loading setup.cfg file [2017-03-22 > 19:35:51,041] {timeout.py:37} ERROR - Process timed out [2017-03-22 > 19:35:51,041] {models.py:266} ERROR - Failed to import: > /usr/local/airflow/pipeline/pipeline.py Traceback (most recent call last): > File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263, > in process_file m = imp.load_source(mod_name, filepath) File > "/usr/local/airflow/pipeline/pipeline.py", line 167, in <module> > create_tasks(dbguid, version, dag, override_start_date) File > "/usr/local/airflow/pipeline/pipeline.py", line 104, in create_tasks t = > create_task(dbguid, dag, taskInfo, version, override_date) File > "/usr/local/airflow/pipeline/pipeline.py", line 85, in create_task retries, > 1, depends_on_past, version, override_dag_date) File > "/usr/local/airflow/pipeline/dags/base_pipeline.py", line 90, in > create_python_operator depends_on_past=depends_on_past) File > "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py", line > 86, in wrapper result = func(*args, **kwargs) File > "/usr/local/lib/python2.7/dist-packages/airflow/operators/python_operator.py", > line 65, in __init__ super(PythonOperator, self).__init__(*args, **kwargs) > File "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py", > line 70, in wrapper sig = signature(func) File "/usr/local/lib/python2.7/ > dist-packages/funcsigs/__init__.py", line 105, in signature return > Signature.from_function(obj) File "/usr/local/lib/python2.7/ > dist-packages/funcsigs/__init__.py", line 594, in from_function > __validate_parameters__=False) File "/usr/local/lib/python2.7/ > dist-packages/funcsigs/__init__.py", line 518, in __init__ for param in > parameters)) File "/usr/lib/python2.7/collections.py", line 52, in __init__ > self.__update(*args, **kwds) File "/usr/lib/python2.7/_abcoll.py", line > 548, in update self[key] = value File "/usr/lib/python2.7/collections.py", > line 61, in __setitem__ last[1] = root[0] = self.__map[key] = [last, root, > key] File "/usr/local/lib/python2.7/dist-packages/airflow/utils/timeout.py", > line 38, in handle_timeout raise AirflowTaskTimeout(self.error_message) > AirflowTaskTimeout: Timeout > > > > > On Fri, Mar 24, 2017 at 5:45 PM, Bolke de Bruin <bdbr...@gmail.com> wrote: > >> We are running *without* num runs for over a year (and never have). It is >> a very elusive issue which has not been reproducible. >> >> I like more info on this but it needs to be very elaborate even to the >> point of access to the system exposing the behavior. >> >> Bolke >> >> Sent from my iPhone >> >>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote: >>> >>> We literally have a cron job that restarts the scheduler every 30 min. >> Num >>> runs didn't work consistently in rc4, sometimes it would restart itself >> and >>> sometimes we'd end up with a few zombie scheduler processes and things >>> would get stuck. Also running locally, without celery. >>> >>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote: >>>> >>>> We have max runs set and still hit this. Our solution is dumber: >>>> monitoring log output, and kill the scheduler if it stops emitting. >> Works >>>> like a charm. >>>> >>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <fhakan.ko...@gmail.com> >>>> wrote: >>>>> >>>>> Some solutions to this problem is restarting the scheduler frequently >> or >>>>> some sort of monitoring on the scheduler. We have set up a dag that >> pings >>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of service) >>>> every >>>>> 10 minutes and the snitch pages you when the scheduler dies and does >> not >>>>> send a ping to it. >>>>> >>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips < >> aphill...@qrmedia.com> >>>>> wrote: >>>>> >>>>>> We use celery and run into it from time to time. >>>>>>> >>>>>> >>>>>> Bang goes my theory ;-) At least, assuming it's the same underlying >>>>>> cause... >>>>>> >>>>>> Regards >>>>>> >>>>>> ap >>>>>> >>>> >>