This is on version 1.8.0. I don't recall seeing this behavior when I was running 1.7.x. On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <[email protected]> wrote:
> What version of airflow? From the top of my mind 1.8.X > > 1) we do use db locking > 2) we check the state after we get the lock > 3) I don't think the task sets a state if it finds out it is running > somewhere else > > Maybe we do something at the executor/scheduler level. That I need to > investigate if this issue is on a recent version. > > Bolke > > Sent from my iPhone > > > On 4 Aug 2017, at 19:24, George Leslie-Waksman > > <[email protected]> > wrote: > > > > Pretty sure (not 100%) what is happening is: > > > > 1. Scheduler bugs result in task getting scheduled twice > > 2. Worker 1 grabs task > > 3. Worker 2 grabs task > > 4. Worker 1 starts task > > 5. Worker 2 starts task > > 6. Worker 2 sees that Worker 1 has started and plans to abort > > 7. Worker 1 finishes and marks task as done > > 8. Worker 2 finishes aborting and marks task as not done > > > > > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <[email protected]> wrote: > >> > >> Hey Max, > >> > >> Thanks for the suggestions. I believe it was a retry (I'm using remote > >> logging so I can only check after the task completes), but the UI never > >> reported it as such. The latest_heartbeat column is in the jobs table, > and > >> interestingly I do see some running jobs that haven't heartbeated for > ~22 > >> minutes. They are LocalTaskJob instances with CeleryExecutor properly > >> listed as the executory class. I can't really correlate these to a > specific > >> task instance, however, as there doesn't appear to be any key written to > >> the jobs table (the dag_id column is all null, and there's no task_id > >> column or anything). > >> > >> Any ideas on what could be making these tasks stop heartbeating > regularly? > >> That could explain why eventually (after an overnight period of time) > >> everything is marked as finished in the Airflow UI: eventually these > tasks > >> do heartbeat again, but quite long after they are finished running. > >> > >> Thanks again! > >> ᐧ > >> > >> -- > >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and > Engagement | > >> New Relic > >> > >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin < > >> [email protected]> wrote: > >> > >>> Are you sure there hasn't been a retry at that point? [One of] the > >> expected > >>> behavior is the one I described, where if a task finished without > >> reporting > >>> it's success [or failure], it will stay marked as RUNNING, but will > fail > >> to > >>> emit a heartbeat (which is a timestamp updated in the task_instance > table > >>> as last_heartbeat or something). The scheduler monitors for RUNNING > >> tasks > >>> without heartbeat and eventually will handle the failure (send emails, > >> call > >>> on_failure_callback, ...). > >>> > >>> Looking for heartbeat in the DB might give you some clues as to what is > >>> going on. > >>> > >>> Also there have been versions where we'd occasionally see double > >>> triggering, and double firing, which can be confusing. Then you can > have > >>> different processes reporting their status and debugging those issues > can > >>> be problematic. I think there's good prevention against that now, using > >>> database transactions as the task instance sets itself as RUNNING. I'm > >> not > >>> sure if 1.8.0 is 100% clean from that regard. > >>> > >>> Max > >>> > >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <[email protected]> > wrote: > >>>> > >>>> It happens mostly when the scheduler is catching up. More > specifically, > >>>> when I load a brand new DAG with a start date in the past. Usually I > >> have > >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the > >> same > >>>> time. > >>>> > >>>> What I've also noticed is that the tasks will sit completed in reality > >>> but > >>>> uncompleted in the Airflow DB for many hours, but if I just leave them > >>> all > >>>> sitting there over night they all tend to be marked complete the next > >>>> morning. Perhaps this points to some sort of Celery timeout or > >> connection > >>>> retry interval? > >>>> ᐧ > >>>> > >>>> -- > >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and > >> Engagement > >>> | > >>>> New Relic > >>>> > >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin < > >>>> [email protected]> wrote: > >>>> > >>>>> By the time "INFO - Task exited with return code 0" gets logged, the > >>> task > >>>>> should have been marked as successful by the subprocess. I have no > >>>> specific > >>>>> intuition as to what the issue may be. > >>>>> > >>>>> I'm guessing at that point the job stops emitting heartbeat and > >>>> eventually > >>>>> the scheduler will handle it as a failure? > >>>>> > >>>>> How often does that happen? > >>>>> > >>>>> Max > >>>>> > >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <[email protected]> > >> wrote: > >>>>> > >>>>>> From what I can tell, it only affects CeleryExecutor. I've never > >> seen > >>>>> this > >>>>>> behavior with LocalExecutor before. > >>>>>> > >>>>>> Max, do you know anything about this type of failure mode? > >>>>>> ᐧ > >>>>>> > >>>>>> -- > >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and > >>>> Engagement > >>>>> | > >>>>>> New Relic > >>>>>> > >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson < > >> [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> We have the exact same problem. In our case, it's a bash operator > >>>>>> starting > >>>>>>> a docker container. The container and process it ran exit, but > >> the > >>>>>> 'docker > >>>>>>> run' command is still showing up in the process table, waiting > >> for > >>> an > >>>>>>> event. > >>>>>>> I'm trying to switch to LocalExecutor to see if that will help. > >>>>>>> > >>>>>>> _jonas > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <[email protected]> > >>>> wrote: > >>>>>>> > >>>>>>>> Hello, > >>>>>>>> > >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where > >>>> workers > >>>>>> will > >>>>>>>> finish their tasks ("INFO - Task exited with return code 0" > >> shows > >>>> in > >>>>>> the > >>>>>>>> logs) but are never marked as complete in the airflow DB or UI? > >>>>>>> Effectively > >>>>>>>> this causes tasks to hang even though they are complete, and > >> the > >>>> DAG > >>>>>> will > >>>>>>>> not continue. > >>>>>>>> > >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps > >>> have a > >>>>>>>> workaround? > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and > >>>>>> Engagement > >>>>>>> | > >>>>>>>> New Relic > >>>>>>>> ᐧ > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > -- Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement | New Relic
