What version of airflow? From the top of my mind 1.8.X 1) we do use db locking 2) we check the state after we get the lock 3) I don't think the task sets a state if it finds out it is running somewhere else
Maybe we do something at the executor/scheduler level. That I need to investigate if this issue is on a recent version. Bolke Sent from my iPhone > On 4 Aug 2017, at 19:24, George Leslie-Waksman > <[email protected]> wrote: > > Pretty sure (not 100%) what is happening is: > > 1. Scheduler bugs result in task getting scheduled twice > 2. Worker 1 grabs task > 3. Worker 2 grabs task > 4. Worker 1 starts task > 5. Worker 2 starts task > 6. Worker 2 sees that Worker 1 has started and plans to abort > 7. Worker 1 finishes and marks task as done > 8. Worker 2 finishes aborting and marks task as not done > > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <[email protected]> wrote: >> >> Hey Max, >> >> Thanks for the suggestions. I believe it was a retry (I'm using remote >> logging so I can only check after the task completes), but the UI never >> reported it as such. The latest_heartbeat column is in the jobs table, and >> interestingly I do see some running jobs that haven't heartbeated for ~22 >> minutes. They are LocalTaskJob instances with CeleryExecutor properly >> listed as the executory class. I can't really correlate these to a specific >> task instance, however, as there doesn't appear to be any key written to >> the jobs table (the dag_id column is all null, and there's no task_id >> column or anything). >> >> Any ideas on what could be making these tasks stop heartbeating regularly? >> That could explain why eventually (after an overnight period of time) >> everything is marked as finished in the Airflow UI: eventually these tasks >> do heartbeat again, but quite long after they are finished running. >> >> Thanks again! >> ᐧ >> >> -- >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement | >> New Relic >> >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin < >> [email protected]> wrote: >> >>> Are you sure there hasn't been a retry at that point? [One of] the >> expected >>> behavior is the one I described, where if a task finished without >> reporting >>> it's success [or failure], it will stay marked as RUNNING, but will fail >> to >>> emit a heartbeat (which is a timestamp updated in the task_instance table >>> as last_heartbeat or something). The scheduler monitors for RUNNING >> tasks >>> without heartbeat and eventually will handle the failure (send emails, >> call >>> on_failure_callback, ...). >>> >>> Looking for heartbeat in the DB might give you some clues as to what is >>> going on. >>> >>> Also there have been versions where we'd occasionally see double >>> triggering, and double firing, which can be confusing. Then you can have >>> different processes reporting their status and debugging those issues can >>> be problematic. I think there's good prevention against that now, using >>> database transactions as the task instance sets itself as RUNNING. I'm >> not >>> sure if 1.8.0 is 100% clean from that regard. >>> >>> Max >>> >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <[email protected]> wrote: >>>> >>>> It happens mostly when the scheduler is catching up. More specifically, >>>> when I load a brand new DAG with a start date in the past. Usually I >> have >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the >> same >>>> time. >>>> >>>> What I've also noticed is that the tasks will sit completed in reality >>> but >>>> uncompleted in the Airflow DB for many hours, but if I just leave them >>> all >>>> sitting there over night they all tend to be marked complete the next >>>> morning. Perhaps this points to some sort of Celery timeout or >> connection >>>> retry interval? >>>> ᐧ >>>> >>>> -- >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and >> Engagement >>> | >>>> New Relic >>>> >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin < >>>> [email protected]> wrote: >>>> >>>>> By the time "INFO - Task exited with return code 0" gets logged, the >>> task >>>>> should have been marked as successful by the subprocess. I have no >>>> specific >>>>> intuition as to what the issue may be. >>>>> >>>>> I'm guessing at that point the job stops emitting heartbeat and >>>> eventually >>>>> the scheduler will handle it as a failure? >>>>> >>>>> How often does that happen? >>>>> >>>>> Max >>>>> >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <[email protected]> >> wrote: >>>>> >>>>>> From what I can tell, it only affects CeleryExecutor. I've never >> seen >>>>> this >>>>>> behavior with LocalExecutor before. >>>>>> >>>>>> Max, do you know anything about this type of failure mode? >>>>>> ᐧ >>>>>> >>>>>> -- >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and >>>> Engagement >>>>> | >>>>>> New Relic >>>>>> >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson < >> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> We have the exact same problem. In our case, it's a bash operator >>>>>> starting >>>>>>> a docker container. The container and process it ran exit, but >> the >>>>>> 'docker >>>>>>> run' command is still showing up in the process table, waiting >> for >>> an >>>>>>> event. >>>>>>> I'm trying to switch to LocalExecutor to see if that will help. >>>>>>> >>>>>>> _jonas >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <[email protected]> >>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where >>>> workers >>>>>> will >>>>>>>> finish their tasks ("INFO - Task exited with return code 0" >> shows >>>> in >>>>>> the >>>>>>>> logs) but are never marked as complete in the airflow DB or UI? >>>>>>> Effectively >>>>>>>> this causes tasks to hang even though they are complete, and >> the >>>> DAG >>>>>> will >>>>>>>> not continue. >>>>>>>> >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps >>> have a >>>>>>>> workaround? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> -- >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and >>>>>> Engagement >>>>>>> | >>>>>>>> New Relic >>>>>>>> ᐧ >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>
