This is on version 1.8.0. I don't recall seeing this behavior when I was
running 1.7.x.
On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <[email protected]> wrote:

> What version of airflow? From the top of my mind 1.8.X
>
> 1) we do use db locking
> 2) we check the state after we get the lock
> 3) I don't think the task sets a state if it finds out it is running
> somewhere else
>
> Maybe we do something at the executor/scheduler level. That I need to
> investigate if this issue is on a recent version.
>
> Bolke
>
> Sent from my iPhone
>
> > On 4 Aug 2017, at 19:24, George Leslie-Waksman 
> > <[email protected]>
> wrote:
> >
> > Pretty sure (not 100%) what is happening is:
> >
> >   1. Scheduler bugs result in task getting scheduled twice
> >   2. Worker 1 grabs task
> >   3. Worker 2 grabs task
> >   4. Worker 1 starts task
> >   5. Worker 2 starts task
> >   6. Worker 2 sees that Worker 1 has started and plans to abort
> >   7. Worker 1 finishes and marks task as done
> >   8. Worker 2 finishes aborting and marks task as not done
> >
> >
> >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <[email protected]> wrote:
> >>
> >> Hey Max,
> >>
> >> Thanks for the suggestions. I believe it was a retry (I'm using remote
> >> logging so I can only check after the task completes), but the UI never
> >> reported it as such. The latest_heartbeat column is in the jobs table,
> and
> >> interestingly I do see some running jobs that haven't heartbeated for
> ~22
> >> minutes. They are LocalTaskJob instances with CeleryExecutor properly
> >> listed as the executory class. I can't really correlate these to a
> specific
> >> task instance, however, as there doesn't appear to be any key written to
> >> the jobs table (the dag_id column is all null, and there's no task_id
> >> column or anything).
> >>
> >> Any ideas on what could be making these tasks stop heartbeating
> regularly?
> >> That could explain why eventually (after an overnight period of time)
> >> everything is marked as finished in the Airflow UI: eventually these
> tasks
> >> do heartbeat again, but quite long after they are finished running.
> >>
> >> Thanks again!
> >> ᐧ
> >>
> >> --
> >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement |
> >> New Relic
> >>
> >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> >> [email protected]> wrote:
> >>
> >>> Are you sure there hasn't been a retry at that point? [One of] the
> >> expected
> >>> behavior is the one I described, where if a task finished without
> >> reporting
> >>> it's success [or failure], it will stay marked as RUNNING, but will
> fail
> >> to
> >>> emit a heartbeat (which is a timestamp updated in the task_instance
> table
> >>> as last_heartbeat or something).  The scheduler monitors for RUNNING
> >> tasks
> >>> without heartbeat and eventually will handle the failure (send emails,
> >> call
> >>> on_failure_callback, ...).
> >>>
> >>> Looking for heartbeat in the DB might give you some clues as to what is
> >>> going on.
> >>>
> >>> Also there have been versions where we'd occasionally see double
> >>> triggering, and double firing, which can be confusing. Then you can
> have
> >>> different processes reporting their status and debugging those issues
> can
> >>> be problematic. I think there's good prevention against that now, using
> >>> database transactions as the task instance sets itself as RUNNING. I'm
> >> not
> >>> sure if 1.8.0 is 100% clean from that regard.
> >>>
> >>> Max
> >>>
> >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <[email protected]>
> wrote:
> >>>>
> >>>> It happens mostly when the scheduler is catching up. More
> specifically,
> >>>> when I load a brand new DAG with a start date in the past. Usually I
> >> have
> >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
> >> same
> >>>> time.
> >>>>
> >>>> What I've also noticed is that the tasks will sit completed in reality
> >>> but
> >>>> uncompleted in the Airflow DB for many hours, but if I just leave them
> >>> all
> >>>> sitting there over night they all tend to be marked complete the next
> >>>> morning. Perhaps this points to some sort of Celery timeout or
> >> connection
> >>>> retry interval?
> >>>> ᐧ
> >>>>
> >>>> --
> >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> >> Engagement
> >>> |
> >>>> New Relic
> >>>>
> >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> By the time "INFO - Task exited with return code 0" gets logged, the
> >>> task
> >>>>> should have been marked as successful by the subprocess. I have no
> >>>> specific
> >>>>> intuition as to what the issue may be.
> >>>>>
> >>>>> I'm guessing at that point the job stops emitting heartbeat and
> >>>> eventually
> >>>>> the scheduler will handle it as a failure?
> >>>>>
> >>>>> How often does that happen?
> >>>>>
> >>>>> Max
> >>>>>
> >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <[email protected]>
> >> wrote:
> >>>>>
> >>>>>> From what I can tell, it only affects CeleryExecutor. I've never
> >> seen
> >>>>> this
> >>>>>> behavior with LocalExecutor before.
> >>>>>>
> >>>>>> Max, do you know anything about this type of failure mode?
> >>>>>> ᐧ
> >>>>>>
> >>>>>> --
> >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> >>>> Engagement
> >>>>> |
> >>>>>> New Relic
> >>>>>>
> >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> We have the exact same problem. In our case, it's a bash operator
> >>>>>> starting
> >>>>>>> a docker container. The container and process it ran exit, but
> >> the
> >>>>>> 'docker
> >>>>>>> run' command is still showing up in the process table, waiting
> >> for
> >>> an
> >>>>>>> event.
> >>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
> >>>>>>>
> >>>>>>> _jonas
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <[email protected]>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> >>>> workers
> >>>>>> will
> >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> >> shows
> >>>> in
> >>>>>> the
> >>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
> >>>>>>> Effectively
> >>>>>>>> this causes tasks to hang even though they are complete, and
> >> the
> >>>> DAG
> >>>>>> will
> >>>>>>>> not continue.
> >>>>>>>>
> >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> >>> have a
> >>>>>>>> workaround?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> >>>>>> Engagement
> >>>>>>> |
> >>>>>>>> New Relic
> >>>>>>>> ᐧ
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
-- 
Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
New Relic

Reply via email to