Any more info from grepping that log file?

G>

On Mon, Mar 27, 2017 at 9:26 PM, Nicholas Hodgkinson <
nik.hodgkin...@collectivehealth.com> wrote:

> from airflow.cfg:
>
> [core]
> ...
> executor = LocalExecutor
> parallelism = 32
> dag_concurrency = 16
> dags_are_paused_at_creation = True
> non_pooled_task_slot_count = 128
> max_active_runs_per_dag = 16
> ...
>
> Pretty much the defaults; I've never tweaked these values.
>
>
>
> -N
> nik.hodgkin...@collectivehealth.com
>
> On Mon, Mar 27, 2017 at 12:12 PM, Gerard Toonstra <gtoons...@gmail.com>
> wrote:
>
> > So looks like the localworkers are dying. Airflow does not recover from
> > that.
> >
> >
> > In SchedulerJob (jobs.py), you can see the "_execute_helper"  function.
> > This calls "executor.start()", which is implemented
> > in local_executor.py in your case.
> >
> > The LocalExecutor is thus an object owned by the SchedulerJob. This
> > executor creates x (parallellism) LocalWorkers,
> > which derive from a multiprocessing.Process class. So the processes you
> see
> > "extra" on the scheduler are those LocalWorkers
> > as child processes. The LocalWorkers create additional processes through
> a
> > shell ("subprocess.check_call" with (shell=True)),
> > which are the things doing the actual work.
> >
> >
> > Before that, on my 'master' here, the LocalWorker issues a *
> > self.logger.info
> > <http://self.logger.info>("{} running {}"  *, which you can find in the
> > general
> > output of the scheduler log file. When starting the scheduler with
> "airflow
> > scheduler", it's what gets printed on the console and starts
> > with "Starting the scheduler". That is the file you want to investigate.
> >
> > If anything bad happens with general processing, then it prints a:
> >
> >                 self.logger.error("failed to execute task
> > {}:".format(str(e)))
> >
> > in the exception handler. I'd grep for that "failed to execute task" in
> the
> > scheduler log file I mentioned.
> >
> >
> > I'm not sure where stdout/stderr go for these workers. If the call
> > basically succeeded, but there were issues with the queue handling,
> > then I'd expect this to go to stderr instead. I'm not 100% sure if that
> > gets sent to the same scheduler log file or whether that goes nowhere
> > because of it being a child process (they're probably inherited?).
> >
> >
> > One further question: what's your parallellism set to?  I see 22 zombies
> > left behind. Is that your setting?
> >
> > Let us know!
> >
> > Rgds,
> >
> > Gerard
> >
> >
> >
> > On Mon, Mar 27, 2017 at 8:13 PM, harish singh <harish.sing...@gmail.com>
> > wrote:
> >
> > > 1.8:  increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue
> > > (although not sure why tasks progress has become slow? But thats not
> the
> > > issue we are discussing here. So I am ignoring that here)
> > >
> > > 1.7:  our prod is running 1.7 and we havent seen the "defunct process"
> > > issue for more than a week now. But we saw something very close to what
> > > Nicholas provided (localexecutor, we do not use --num-runs)
> > > Not sure if cpu/memory limit may lead to this issue. Often when we hit
> > this
> > > issue (which stalled the pipeline), we either increased the memory
> and/or
> > > moved airflow to a bulkier (cpu) instance.
> > >
> > > Sorry for a late reply. Was out of town over the weekend.
> > >
> > >
> > >
> > > On Mon, Mar 27, 2017 at 10:47 AM, Nicholas Hodgkinson <
> > > nik.hodgkin...@collectivehealth.com> wrote:
> > >
> > > > 1.7.1.3, however it seems this is still an issue in 1.8 according to
> > > other
> > > > posters. I'll upgrade today.
> > > > Yes, localexecutor.
> > > > Will remove -n 10
> > > >
> > > > -N
> > > > nik.hodgkin...@collectivehealth.com
> > > >
> > > >
> > > > On Mon, Mar 27, 2017 at 10:40 AM, Bolke de Bruin <bdbr...@gmail.com>
> > > > wrote:
> > > >
> > > > > Is this:
> > > > >
> > > > > 1. On 1.8.0? 1.7.1 is not supported anymore.
> > > > > 2. localexecutor?
> > > > >
> > > > > Your are running with N=10, can you try running without it?
> > > > >
> > > > > B.
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson <nik.hodgkinson@
> > > > > collectivehealth.com> wrote:
> > > > > >
> > > > > > Ok, I'm not sure how helpful this is and I'm working on getting
> > some
> > > > more
> > > > > > information, but here's some preliminary data:
> > > > > >
> > > > > > Process tree (`ps axjf`):
> > > > > >    1  2391  2391  2391 ?           -1 Ssl    999   0:13
> > > /usr/bin/python
> > > > > > usr/local/bin/airflow scheduler -n 10
> > > > > > 2391  2435  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2436  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2437  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2438  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2439  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2440  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2441  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2442  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2443  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2444  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2454  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2456  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2457  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2458  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2459  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2460  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2461  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2462  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2463  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2464  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2465  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > > 2391  2466  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > > [/usr/bin/python] <defunct>
> > > > > >
> > > > > > # gdb python 2391
> > > > > > Reading symbols from python...Reading symbols from
> > > > > > /usr/lib/debug//usr/bin/python2.7...done.
> > > > > > done.
> > > > > > Attaching to program: /usr/bin/python, process 2391
> > > > > > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading
> symbols
> > > > from
> > > > > > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
> > > > > > done.
> > > > > > Loaded symbols for /lib64/ld-linux-x86-64.so.2
> > > > > > 0x00007f0c1bbb9670 in ?? ()
> > > > > > (gdb) bt
> > > > > > #0  0x00007f0c1bbb9670 in ?? ()
> > > > > > #1  0x00007f0c1bf1a000 in ?? ()
> > > > > > #2  0x00007f0c12099b45 in ?? ()
> > > > > > #3  0x00000000032dbe00 in ?? ()
> > > > > > #4  0x0000000000000000 in ?? ()
> > > > > > (gdb) py-bt
> > > > > > (gdb) py-list
> > > > > > Unable to locate python frame
> > > > > >
> > > > > > I know that's not super helpful, but it's information; I've also
> > > tried
> > > > > > pyrasite, but got nothing from it of any use. This problem occurs
> > for
> > > > me
> > > > > > very often and I'm happy to provide a modified environment in
> which
> > > to
> > > > > > capture info if anyone has a suggestion. For now I need to
> restart
> > my
> > > > > > process and get my jobs running again.
> > > > > >
> > > > > > -N
> > > > > > nik.hodgkin...@collectivehealth.com
>
>

Reply via email to