For 1.8 and the issue you are seeing you might want to try increasing:

DAGBAG_IMPORT_TIMEOUT under core which defaults to 30. 

This reminds me that doing timeouts this way cannot be done in child processes 
and might explain the defunct processes, so please test if that works. 

Bolke 

Sent from my iPhone

> On 24 Mar 2017, at 19:07, harish singh <harish.sing...@gmail.com> wrote:
> 
> We have been using (1.7) over a year and never faced this issue.
> The moment we switched to 1.8, I think we have hit this issue.
> The reason why I saw "I think" is because I am not sure if it is the same
> issue. But whenever I restart, my pipeline proceeds.
> 
> 
> 
> *Airflow 1.7Having said that, In 1.7, I did face a similar issue (less than
> 5 times over a year): *
> *I saw that there were lot of processes marked  "<defunct>"  with parent
> process being "scheduler". *
> 
> *Somebody mentioned it in this jira ->
> https://issues.apache.org/jira/browse/AIRFLOW-401
> <https://issues.apache.org/jira/browse/AIRFLOW-401>*
> *Workaround:  Restart scheduler*
> 
> 
> 
> 
> *Airflow 1.8:Now the issue in 1.8 may be different then the issue in
> 1.7 But again the issue get solved and pipeline progresses on a SCHEDULER
> RESTART.*If it may help, this is the trace in 1.8:
> [2017-03-22 19:35:16,332] {models.py:167} INFO - Filling up the DagBag from
> /usr/local/airflow/pipeline/pipeline.py [2017-03-22 19:35:22,451]
> {airflow_configuration.py:40} INFO - loading setup.cfg file [2017-03-22
> 19:35:51,041] {timeout.py:37} ERROR - Process timed out [2017-03-22
> 19:35:51,041] {models.py:266} ERROR - Failed to import:
> /usr/local/airflow/pipeline/pipeline.py Traceback (most recent call last):
> File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263,
> in process_file m = imp.load_source(mod_name, filepath) File
> "/usr/local/airflow/pipeline/pipeline.py", line 167, in <module>
> create_tasks(dbguid, version, dag, override_start_date) File
> "/usr/local/airflow/pipeline/pipeline.py", line 104, in create_tasks t =
> create_task(dbguid, dag, taskInfo, version, override_date) File
> "/usr/local/airflow/pipeline/pipeline.py", line 85, in create_task retries,
> 1, depends_on_past, version, override_dag_date) File
> "/usr/local/airflow/pipeline/dags/base_pipeline.py", line 90, in
> create_python_operator depends_on_past=depends_on_past) File
> "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py", line
> 86, in wrapper result = func(*args, **kwargs) File
> "/usr/local/lib/python2.7/dist-packages/airflow/operators/python_operator.py",
> line 65, in __init__ super(PythonOperator, self).__init__(*args, **kwargs)
> File "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py",
> line 70, in wrapper sig = signature(func) File "/usr/local/lib/python2.7/
> dist-packages/funcsigs/__init__.py", line 105, in signature return
> Signature.from_function(obj) File "/usr/local/lib/python2.7/
> dist-packages/funcsigs/__init__.py", line 594, in from_function
> __validate_parameters__=False) File "/usr/local/lib/python2.7/
> dist-packages/funcsigs/__init__.py", line 518, in __init__ for param in
> parameters)) File "/usr/lib/python2.7/collections.py", line 52, in __init__
> self.__update(*args, **kwds) File "/usr/lib/python2.7/_abcoll.py", line
> 548, in update self[key] = value File "/usr/lib/python2.7/collections.py",
> line 61, in __setitem__ last[1] = root[0] = self.__map[key] = [last, root,
> key] File "/usr/local/lib/python2.7/dist-packages/airflow/utils/timeout.py",
> line 38, in handle_timeout raise AirflowTaskTimeout(self.error_message)
> AirflowTaskTimeout: Timeout
> 
> 
> 
> 
>> On Fri, Mar 24, 2017 at 5:45 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:
>> 
>> We are running *without* num runs for over a year (and never have). It is
>> a very elusive issue which has not been reproducible.
>> 
>> I like more info on this but it needs to be very elaborate even to the
>> point of access to the system exposing the behavior.
>> 
>> Bolke
>> 
>> Sent from my iPhone
>> 
>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote:
>>> 
>>> We literally have a cron job that restarts the scheduler every 30 min.
>> Num
>>> runs didn't work consistently in rc4, sometimes it would restart itself
>> and
>>> sometimes we'd end up with a few zombie scheduler processes and things
>>> would get stuck. Also running locally, without celery.
>>> 
>>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote:
>>>> 
>>>> We have max runs set and still hit this. Our solution is dumber:
>>>> monitoring log output, and kill the scheduler if it stops emitting.
>> Works
>>>> like a charm.
>>>> 
>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <fhakan.ko...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Some solutions to this problem is restarting the scheduler frequently
>> or
>>>>> some sort of monitoring on the scheduler. We have set up a dag that
>> pings
>>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of service)
>>>> every
>>>>> 10 minutes and the snitch pages you when the scheduler dies and does
>> not
>>>>> send a ping to it.
>>>>> 
>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <
>> aphill...@qrmedia.com>
>>>>> wrote:
>>>>> 
>>>>>> We use celery and run into it from time to time.
>>>>>>> 
>>>>>> 
>>>>>> Bang goes my theory ;-) At least, assuming it's the same underlying
>>>>>> cause...
>>>>>> 
>>>>>> Regards
>>>>>> 
>>>>>> ap
>>>>>> 
>>>> 
>> 

Reply via email to