Sorry Marc, must be tired, I responded to the wrong thread. I was thinking of Airflow CPU usage in the browser.
Apologies. Best, Arthur On Fri, Aug 4, 2017 at 2:26 PM, Marc Weil <[email protected]> wrote: > This particular behavior I was reporting isn't related to the browser. The > tasks remain in the incorrect state in the database as well. > On Fri, Aug 4, 2017 at 5:15 PM Arthur Wiedmer <[email protected]> > wrote: > > > I have noticed the issue more prominently with Chrome after v54, I was > > wondering if this was related at all. Somehow, airflow seems kinder on > > Firefox. > > > > Does anyone else have a similar experience? > > > > Best, > > Arthur > > > > On Fri, Aug 4, 2017 at 12:02 PM, Marc Weil <[email protected]> wrote: > > > > > This is on version 1.8.0. I don't recall seeing this behavior when I > was > > > running 1.7.x. > > > On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <[email protected]> > wrote: > > > > > > > What version of airflow? From the top of my mind 1.8.X > > > > > > > > 1) we do use db locking > > > > 2) we check the state after we get the lock > > > > 3) I don't think the task sets a state if it finds out it is running > > > > somewhere else > > > > > > > > Maybe we do something at the executor/scheduler level. That I need to > > > > investigate if this issue is on a recent version. > > > > > > > > Bolke > > > > > > > > Sent from my iPhone > > > > > > > > > On 4 Aug 2017, at 19:24, George Leslie-Waksman < > > > [email protected]> > > > > wrote: > > > > > > > > > > Pretty sure (not 100%) what is happening is: > > > > > > > > > > 1. Scheduler bugs result in task getting scheduled twice > > > > > 2. Worker 1 grabs task > > > > > 3. Worker 2 grabs task > > > > > 4. Worker 1 starts task > > > > > 5. Worker 2 starts task > > > > > 6. Worker 2 sees that Worker 1 has started and plans to abort > > > > > 7. Worker 1 finishes and marks task as done > > > > > 8. Worker 2 finishes aborting and marks task as not done > > > > > > > > > > > > > > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <[email protected]> > > wrote: > > > > >> > > > > >> Hey Max, > > > > >> > > > > >> Thanks for the suggestions. I believe it was a retry (I'm using > > remote > > > > >> logging so I can only check after the task completes), but the UI > > > never > > > > >> reported it as such. The latest_heartbeat column is in the jobs > > table, > > > > and > > > > >> interestingly I do see some running jobs that haven't heartbeated > > for > > > > ~22 > > > > >> minutes. They are LocalTaskJob instances with CeleryExecutor > > properly > > > > >> listed as the executory class. I can't really correlate these to a > > > > specific > > > > >> task instance, however, as there doesn't appear to be any key > > written > > > to > > > > >> the jobs table (the dag_id column is all null, and there's no > > task_id > > > > >> column or anything). > > > > >> > > > > >> Any ideas on what could be making these tasks stop heartbeating > > > > regularly? > > > > >> That could explain why eventually (after an overnight period of > > time) > > > > >> everything is marked as finished in the Airflow UI: eventually > these > > > > tasks > > > > >> do heartbeat again, but quite long after they are finished > running. > > > > >> > > > > >> Thanks again! > > > > >> ᐧ > > > > >> > > > > >> -- > > > > >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > > > Engagement | > > > > >> New Relic > > > > >> > > > > >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin < > > > > >> [email protected]> wrote: > > > > >> > > > > >>> Are you sure there hasn't been a retry at that point? [One of] > the > > > > >> expected > > > > >>> behavior is the one I described, where if a task finished without > > > > >> reporting > > > > >>> it's success [or failure], it will stay marked as RUNNING, but > will > > > > fail > > > > >> to > > > > >>> emit a heartbeat (which is a timestamp updated in the > task_instance > > > > table > > > > >>> as last_heartbeat or something). The scheduler monitors for > > RUNNING > > > > >> tasks > > > > >>> without heartbeat and eventually will handle the failure (send > > > emails, > > > > >> call > > > > >>> on_failure_callback, ...). > > > > >>> > > > > >>> Looking for heartbeat in the DB might give you some clues as to > > what > > > is > > > > >>> going on. > > > > >>> > > > > >>> Also there have been versions where we'd occasionally see double > > > > >>> triggering, and double firing, which can be confusing. Then you > can > > > > have > > > > >>> different processes reporting their status and debugging those > > issues > > > > can > > > > >>> be problematic. I think there's good prevention against that now, > > > using > > > > >>> database transactions as the task instance sets itself as > RUNNING. > > > I'm > > > > >> not > > > > >>> sure if 1.8.0 is 100% clean from that regard. > > > > >>> > > > > >>> Max > > > > >>> > > > > >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <[email protected] > > > > > > wrote: > > > > >>>> > > > > >>>> It happens mostly when the scheduler is catching up. More > > > > specifically, > > > > >>>> when I load a brand new DAG with a start date in the past. > > Usually I > > > > >> have > > > > >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at > > the > > > > >> same > > > > >>>> time. > > > > >>>> > > > > >>>> What I've also noticed is that the tasks will sit completed in > > > reality > > > > >>> but > > > > >>>> uncompleted in the Airflow DB for many hours, but if I just > leave > > > them > > > > >>> all > > > > >>>> sitting there over night they all tend to be marked complete the > > > next > > > > >>>> morning. Perhaps this points to some sort of Celery timeout or > > > > >> connection > > > > >>>> retry interval? > > > > >>>> ᐧ > > > > >>>> > > > > >>>> -- > > > > >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > > > >> Engagement > > > > >>> | > > > > >>>> New Relic > > > > >>>> > > > > >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin < > > > > >>>> [email protected]> wrote: > > > > >>>> > > > > >>>>> By the time "INFO - Task exited with return code 0" gets > logged, > > > the > > > > >>> task > > > > >>>>> should have been marked as successful by the subprocess. I have > > no > > > > >>>> specific > > > > >>>>> intuition as to what the issue may be. > > > > >>>>> > > > > >>>>> I'm guessing at that point the job stops emitting heartbeat and > > > > >>>> eventually > > > > >>>>> the scheduler will handle it as a failure? > > > > >>>>> > > > > >>>>> How often does that happen? > > > > >>>>> > > > > >>>>> Max > > > > >>>>> > > > > >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <[email protected] > > > > > > >> wrote: > > > > >>>>> > > > > >>>>>> From what I can tell, it only affects CeleryExecutor. I've > never > > > > >> seen > > > > >>>>> this > > > > >>>>>> behavior with LocalExecutor before. > > > > >>>>>> > > > > >>>>>> Max, do you know anything about this type of failure mode? > > > > >>>>>> ᐧ > > > > >>>>>> > > > > >>>>>> -- > > > > >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > > > >>>> Engagement > > > > >>>>> | > > > > >>>>>> New Relic > > > > >>>>>> > > > > >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson < > > > > >> [email protected]> > > > > >>>>>> wrote: > > > > >>>>>> > > > > >>>>>>> We have the exact same problem. In our case, it's a bash > > operator > > > > >>>>>> starting > > > > >>>>>>> a docker container. The container and process it ran exit, > but > > > > >> the > > > > >>>>>> 'docker > > > > >>>>>>> run' command is still showing up in the process table, > waiting > > > > >> for > > > > >>> an > > > > >>>>>>> event. > > > > >>>>>>> I'm trying to switch to LocalExecutor to see if that will > help. > > > > >>>>>>> > > > > >>>>>>> _jonas > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil < > [email protected]> > > > > >>>> wrote: > > > > >>>>>>> > > > > >>>>>>>> Hello, > > > > >>>>>>>> > > > > >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where > > > > >>>> workers > > > > >>>>>> will > > > > >>>>>>>> finish their tasks ("INFO - Task exited with return code 0" > > > > >> shows > > > > >>>> in > > > > >>>>>> the > > > > >>>>>>>> logs) but are never marked as complete in the airflow DB or > > UI? > > > > >>>>>>> Effectively > > > > >>>>>>>> this causes tasks to hang even though they are complete, and > > > > >> the > > > > >>>> DAG > > > > >>>>>> will > > > > >>>>>>>> not continue. > > > > >>>>>>>> > > > > >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps > > > > >>> have a > > > > >>>>>>>> workaround? > > > > >>>>>>>> > > > > >>>>>>>> Thanks! > > > > >>>>>>>> > > > > >>>>>>>> -- > > > > >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, > and > > > > >>>>>> Engagement > > > > >>>>>>> | > > > > >>>>>>>> New Relic > > > > >>>>>>>> ᐧ > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > -- > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > Engagement > > | > > > New Relic > > > > > > -- > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement | > New Relic >
