Any more info from grepping that log file? G>
On Mon, Mar 27, 2017 at 9:26 PM, Nicholas Hodgkinson < nik.hodgkin...@collectivehealth.com> wrote: > from airflow.cfg: > > [core] > ... > executor = LocalExecutor > parallelism = 32 > dag_concurrency = 16 > dags_are_paused_at_creation = True > non_pooled_task_slot_count = 128 > max_active_runs_per_dag = 16 > ... > > Pretty much the defaults; I've never tweaked these values. > > > > -N > nik.hodgkin...@collectivehealth.com > > On Mon, Mar 27, 2017 at 12:12 PM, Gerard Toonstra <gtoons...@gmail.com> > wrote: > > > So looks like the localworkers are dying. Airflow does not recover from > > that. > > > > > > In SchedulerJob (jobs.py), you can see the "_execute_helper" function. > > This calls "executor.start()", which is implemented > > in local_executor.py in your case. > > > > The LocalExecutor is thus an object owned by the SchedulerJob. This > > executor creates x (parallellism) LocalWorkers, > > which derive from a multiprocessing.Process class. So the processes you > see > > "extra" on the scheduler are those LocalWorkers > > as child processes. The LocalWorkers create additional processes through > a > > shell ("subprocess.check_call" with (shell=True)), > > which are the things doing the actual work. > > > > > > Before that, on my 'master' here, the LocalWorker issues a * > > self.logger.info > > <http://self.logger.info>("{} running {}" *, which you can find in the > > general > > output of the scheduler log file. When starting the scheduler with > "airflow > > scheduler", it's what gets printed on the console and starts > > with "Starting the scheduler". That is the file you want to investigate. > > > > If anything bad happens with general processing, then it prints a: > > > > self.logger.error("failed to execute task > > {}:".format(str(e))) > > > > in the exception handler. I'd grep for that "failed to execute task" in > the > > scheduler log file I mentioned. > > > > > > I'm not sure where stdout/stderr go for these workers. If the call > > basically succeeded, but there were issues with the queue handling, > > then I'd expect this to go to stderr instead. I'm not 100% sure if that > > gets sent to the same scheduler log file or whether that goes nowhere > > because of it being a child process (they're probably inherited?). > > > > > > One further question: what's your parallellism set to? I see 22 zombies > > left behind. Is that your setting? > > > > Let us know! > > > > Rgds, > > > > Gerard > > > > > > > > On Mon, Mar 27, 2017 at 8:13 PM, harish singh <harish.sing...@gmail.com> > > wrote: > > > > > 1.8: increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue > > > (although not sure why tasks progress has become slow? But thats not > the > > > issue we are discussing here. So I am ignoring that here) > > > > > > 1.7: our prod is running 1.7 and we havent seen the "defunct process" > > > issue for more than a week now. But we saw something very close to what > > > Nicholas provided (localexecutor, we do not use --num-runs) > > > Not sure if cpu/memory limit may lead to this issue. Often when we hit > > this > > > issue (which stalled the pipeline), we either increased the memory > and/or > > > moved airflow to a bulkier (cpu) instance. > > > > > > Sorry for a late reply. Was out of town over the weekend. > > > > > > > > > > > > On Mon, Mar 27, 2017 at 10:47 AM, Nicholas Hodgkinson < > > > nik.hodgkin...@collectivehealth.com> wrote: > > > > > > > 1.7.1.3, however it seems this is still an issue in 1.8 according to > > > other > > > > posters. I'll upgrade today. > > > > Yes, localexecutor. > > > > Will remove -n 10 > > > > > > > > -N > > > > nik.hodgkin...@collectivehealth.com > > > > > > > > > > > > On Mon, Mar 27, 2017 at 10:40 AM, Bolke de Bruin <bdbr...@gmail.com> > > > > wrote: > > > > > > > > > Is this: > > > > > > > > > > 1. On 1.8.0? 1.7.1 is not supported anymore. > > > > > 2. localexecutor? > > > > > > > > > > Your are running with N=10, can you try running without it? > > > > > > > > > > B. > > > > > > > > > > Sent from my iPhone > > > > > > > > > > > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson <nik.hodgkinson@ > > > > > collectivehealth.com> wrote: > > > > > > > > > > > > Ok, I'm not sure how helpful this is and I'm working on getting > > some > > > > more > > > > > > information, but here's some preliminary data: > > > > > > > > > > > > Process tree (`ps axjf`): > > > > > > 1 2391 2391 2391 ? -1 Ssl 999 0:13 > > > /usr/bin/python > > > > > > usr/local/bin/airflow scheduler -n 10 > > > > > > 2391 2435 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2436 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2437 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2438 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2439 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2440 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2441 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2442 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2443 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2444 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2454 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2456 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2457 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2458 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2459 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2460 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2461 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2462 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2463 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2464 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2465 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > 2391 2466 2391 2391 ? -1 Z 999 0:00 \_ > > > > > > [/usr/bin/python] <defunct> > > > > > > > > > > > > # gdb python 2391 > > > > > > Reading symbols from python...Reading symbols from > > > > > > /usr/lib/debug//usr/bin/python2.7...done. > > > > > > done. > > > > > > Attaching to program: /usr/bin/python, process 2391 > > > > > > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading > symbols > > > > from > > > > > > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done. > > > > > > done. > > > > > > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > > > > > > 0x00007f0c1bbb9670 in ?? () > > > > > > (gdb) bt > > > > > > #0 0x00007f0c1bbb9670 in ?? () > > > > > > #1 0x00007f0c1bf1a000 in ?? () > > > > > > #2 0x00007f0c12099b45 in ?? () > > > > > > #3 0x00000000032dbe00 in ?? () > > > > > > #4 0x0000000000000000 in ?? () > > > > > > (gdb) py-bt > > > > > > (gdb) py-list > > > > > > Unable to locate python frame > > > > > > > > > > > > I know that's not super helpful, but it's information; I've also > > > tried > > > > > > pyrasite, but got nothing from it of any use. This problem occurs > > for > > > > me > > > > > > very often and I'm happy to provide a modified environment in > which > > > to > > > > > > capture info if anyone has a suggestion. For now I need to > restart > > my > > > > > > process and get my jobs running again. > > > > > > > > > > > > -N > > > > > > nik.hodgkin...@collectivehealth.com > >