Re: Completed tasks not being marked as completed

Arthur Wiedmer Fri, 04 Aug 2017 14:41:27 -0700

Sorry Marc, must be tired, I responded to the wrong thread.

I was thinking of Airflow CPU usage in the browser.


Apologies.

Best,
Arthur

On Fri, Aug 4, 2017 at 2:26 PM, Marc Weil <[email protected]> wrote:

> This particular behavior I was reporting isn't related to the browser. The
> tasks remain in the incorrect state in the database as well.
> On Fri, Aug 4, 2017 at 5:15 PM Arthur Wiedmer <[email protected]>
> wrote:
>
> > I have noticed the issue more prominently with Chrome after v54, I was
> > wondering if this was related at all. Somehow, airflow seems kinder on
> > Firefox.
> >
> > Does anyone else have a similar experience?
> >
> > Best,
> > Arthur
> >
> > On Fri, Aug 4, 2017 at 12:02 PM, Marc Weil <[email protected]> wrote:
> >
> > > This is on version 1.8.0. I don't recall seeing this behavior when I
> was
> > > running 1.7.x.
> > > On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <[email protected]>
> wrote:
> > >
> > > > What version of airflow? From the top of my mind 1.8.X
> > > >
> > > > 1) we do use db locking
> > > > 2) we check the state after we get the lock
> > > > 3) I don't think the task sets a state if it finds out it is running
> > > > somewhere else
> > > >
> > > > Maybe we do something at the executor/scheduler level. That I need to
> > > > investigate if this issue is on a recent version.
> > > >
> > > > Bolke
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On 4 Aug 2017, at 19:24, George Leslie-Waksman <
> > > [email protected]>
> > > > wrote:
> > > > >
> > > > > Pretty sure (not 100%) what is happening is:
> > > > >
> > > > >   1. Scheduler bugs result in task getting scheduled twice
> > > > >   2. Worker 1 grabs task
> > > > >   3. Worker 2 grabs task
> > > > >   4. Worker 1 starts task
> > > > >   5. Worker 2 starts task
> > > > >   6. Worker 2 sees that Worker 1 has started and plans to abort
> > > > >   7. Worker 1 finishes and marks task as done
> > > > >   8. Worker 2 finishes aborting and marks task as not done
> > > > >
> > > > >
> > > > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <[email protected]>
> > wrote:
> > > > >>
> > > > >> Hey Max,
> > > > >>
> > > > >> Thanks for the suggestions. I believe it was a retry (I'm using
> > remote
> > > > >> logging so I can only check after the task completes), but the UI
> > > never
> > > > >> reported it as such. The latest_heartbeat column is in the jobs
> > table,
> > > > and
> > > > >> interestingly I do see some running jobs that haven't heartbeated
> > for
> > > > ~22
> > > > >> minutes. They are LocalTaskJob instances with CeleryExecutor
> > properly
> > > > >> listed as the executory class. I can't really correlate these to a
> > > > specific
> > > > >> task instance, however, as there doesn't appear to be any key
> > written
> > > to
> > > > >> the jobs table (the dag_id column is all null, and there's no
> > task_id
> > > > >> column or anything).
> > > > >>
> > > > >> Any ideas on what could be making these tasks stop heartbeating
> > > > regularly?
> > > > >> That could explain why eventually (after an overnight period of
> > time)
> > > > >> everything is marked as finished in the Airflow UI: eventually
> these
> > > > tasks
> > > > >> do heartbeat again, but quite long after they are finished
> running.
> > > > >>
> > > > >> Thanks again!
> > > > >> ᐧ
> > > > >>
> > > > >> --
> > > > >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > Engagement |
> > > > >> New Relic
> > > > >>
> > > > >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> > > > >> [email protected]> wrote:
> > > > >>
> > > > >>> Are you sure there hasn't been a retry at that point? [One of]
> the
> > > > >> expected
> > > > >>> behavior is the one I described, where if a task finished without
> > > > >> reporting
> > > > >>> it's success [or failure], it will stay marked as RUNNING, but
> will
> > > > fail
> > > > >> to
> > > > >>> emit a heartbeat (which is a timestamp updated in the
> task_instance
> > > > table
> > > > >>> as last_heartbeat or something).  The scheduler monitors for
> > RUNNING
> > > > >> tasks
> > > > >>> without heartbeat and eventually will handle the failure (send
> > > emails,
> > > > >> call
> > > > >>> on_failure_callback, ...).
> > > > >>>
> > > > >>> Looking for heartbeat in the DB might give you some clues as to
> > what
> > > is
> > > > >>> going on.
> > > > >>>
> > > > >>> Also there have been versions where we'd occasionally see double
> > > > >>> triggering, and double firing, which can be confusing. Then you
> can
> > > > have
> > > > >>> different processes reporting their status and debugging those
> > issues
> > > > can
> > > > >>> be problematic. I think there's good prevention against that now,
> > > using
> > > > >>> database transactions as the task instance sets itself as
> RUNNING.
> > > I'm
> > > > >> not
> > > > >>> sure if 1.8.0 is 100% clean from that regard.
> > > > >>>
> > > > >>> Max
> > > > >>>
> > > > >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <[email protected]
> >
> > > > wrote:
> > > > >>>>
> > > > >>>> It happens mostly when the scheduler is catching up. More
> > > > specifically,
> > > > >>>> when I load a brand new DAG with a start date in the past.
> > Usually I
> > > > >> have
> > > > >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at
> > the
> > > > >> same
> > > > >>>> time.
> > > > >>>>
> > > > >>>> What I've also noticed is that the tasks will sit completed in
> > > reality
> > > > >>> but
> > > > >>>> uncompleted in the Airflow DB for many hours, but if I just
> leave
> > > them
> > > > >>> all
> > > > >>>> sitting there over night they all tend to be marked complete the
> > > next
> > > > >>>> morning. Perhaps this points to some sort of Celery timeout or
> > > > >> connection
> > > > >>>> retry interval?
> > > > >>>> ᐧ
> > > > >>>>
> > > > >>>> --
> > > > >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > >> Engagement
> > > > >>> |
> > > > >>>> New Relic
> > > > >>>>
> > > > >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > > > >>>> [email protected]> wrote:
> > > > >>>>
> > > > >>>>> By the time "INFO - Task exited with return code 0" gets
> logged,
> > > the
> > > > >>> task
> > > > >>>>> should have been marked as successful by the subprocess. I have
> > no
> > > > >>>> specific
> > > > >>>>> intuition as to what the issue may be.
> > > > >>>>>
> > > > >>>>> I'm guessing at that point the job stops emitting heartbeat and
> > > > >>>> eventually
> > > > >>>>> the scheduler will handle it as a failure?
> > > > >>>>>
> > > > >>>>> How often does that happen?
> > > > >>>>>
> > > > >>>>> Max
> > > > >>>>>
> > > > >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <[email protected]
> >
> > > > >> wrote:
> > > > >>>>>
> > > > >>>>>> From what I can tell, it only affects CeleryExecutor. I've
> never
> > > > >> seen
> > > > >>>>> this
> > > > >>>>>> behavior with LocalExecutor before.
> > > > >>>>>>
> > > > >>>>>> Max, do you know anything about this type of failure mode?
> > > > >>>>>> ᐧ
> > > > >>>>>>
> > > > >>>>>> --
> > > > >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > >>>> Engagement
> > > > >>>>> |
> > > > >>>>>> New Relic
> > > > >>>>>>
> > > > >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> > > > >> [email protected]>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> We have the exact same problem. In our case, it's a bash
> > operator
> > > > >>>>>> starting
> > > > >>>>>>> a docker container. The container and process it ran exit,
> but
> > > > >> the
> > > > >>>>>> 'docker
> > > > >>>>>>> run' command is still showing up in the process table,
> waiting
> > > > >> for
> > > > >>> an
> > > > >>>>>>> event.
> > > > >>>>>>> I'm trying to switch to LocalExecutor to see if that will
> help.
> > > > >>>>>>>
> > > > >>>>>>> _jonas
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <
> [email protected]>
> > > > >>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hello,
> > > > >>>>>>>>
> > > > >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> > > > >>>> workers
> > > > >>>>>> will
> > > > >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> > > > >> shows
> > > > >>>> in
> > > > >>>>>> the
> > > > >>>>>>>> logs) but are never marked as complete in the airflow DB or
> > UI?
> > > > >>>>>>> Effectively
> > > > >>>>>>>> this causes tasks to hang even though they are complete, and
> > > > >> the
> > > > >>>> DAG
> > > > >>>>>> will
> > > > >>>>>>>> not continue.
> > > > >>>>>>>>
> > > > >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> > > > >>> have a
> > > > >>>>>>>> workaround?
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks!
> > > > >>>>>>>>
> > > > >>>>>>>> --
> > > > >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing,
> and
> > > > >>>>>> Engagement
> > > > >>>>>>> |
> > > > >>>>>>>> New Relic
> > > > >>>>>>>> ᐧ
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > >
> >
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>

Re: Completed tasks not being marked as completed

Reply via email to