Pretty sure (not 100%) what is happening is: 1. Scheduler bugs result in task getting scheduled twice 2. Worker 1 grabs task 3. Worker 2 grabs task 4. Worker 1 starts task 5. Worker 2 starts task 6. Worker 2 sees that Worker 1 has started and plans to abort 7. Worker 1 finishes and marks task as done 8. Worker 2 finishes aborting and marks task as not done
On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <[email protected]> wrote: > Hey Max, > > Thanks for the suggestions. I believe it was a retry (I'm using remote > logging so I can only check after the task completes), but the UI never > reported it as such. The latest_heartbeat column is in the jobs table, and > interestingly I do see some running jobs that haven't heartbeated for ~22 > minutes. They are LocalTaskJob instances with CeleryExecutor properly > listed as the executory class. I can't really correlate these to a specific > task instance, however, as there doesn't appear to be any key written to > the jobs table (the dag_id column is all null, and there's no task_id > column or anything). > > Any ideas on what could be making these tasks stop heartbeating regularly? > That could explain why eventually (after an overnight period of time) > everything is marked as finished in the Airflow UI: eventually these tasks > do heartbeat again, but quite long after they are finished running. > > Thanks again! > ᐧ > > -- > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement | > New Relic > > On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin < > [email protected]> wrote: > > > Are you sure there hasn't been a retry at that point? [One of] the > expected > > behavior is the one I described, where if a task finished without > reporting > > it's success [or failure], it will stay marked as RUNNING, but will fail > to > > emit a heartbeat (which is a timestamp updated in the task_instance table > > as last_heartbeat or something). The scheduler monitors for RUNNING > tasks > > without heartbeat and eventually will handle the failure (send emails, > call > > on_failure_callback, ...). > > > > Looking for heartbeat in the DB might give you some clues as to what is > > going on. > > > > Also there have been versions where we'd occasionally see double > > triggering, and double firing, which can be confusing. Then you can have > > different processes reporting their status and debugging those issues can > > be problematic. I think there's good prevention against that now, using > > database transactions as the task instance sets itself as RUNNING. I'm > not > > sure if 1.8.0 is 100% clean from that regard. > > > > Max > > > > On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <[email protected]> wrote: > > > > > It happens mostly when the scheduler is catching up. More specifically, > > > when I load a brand new DAG with a start date in the past. Usually I > have > > > it set to run 5 DAG runs at the same time, and up to 16 tasks at the > same > > > time. > > > > > > What I've also noticed is that the tasks will sit completed in reality > > but > > > uncompleted in the Airflow DB for many hours, but if I just leave them > > all > > > sitting there over night they all tend to be marked complete the next > > > morning. Perhaps this points to some sort of Celery timeout or > connection > > > retry interval? > > > ᐧ > > > > > > -- > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > Engagement > > | > > > New Relic > > > > > > On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin < > > > [email protected]> wrote: > > > > > > > By the time "INFO - Task exited with return code 0" gets logged, the > > task > > > > should have been marked as successful by the subprocess. I have no > > > specific > > > > intuition as to what the issue may be. > > > > > > > > I'm guessing at that point the job stops emitting heartbeat and > > > eventually > > > > the scheduler will handle it as a failure? > > > > > > > > How often does that happen? > > > > > > > > Max > > > > > > > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <[email protected]> > wrote: > > > > > > > > > From what I can tell, it only affects CeleryExecutor. I've never > seen > > > > this > > > > > behavior with LocalExecutor before. > > > > > > > > > > Max, do you know anything about this type of failure mode? > > > > > ᐧ > > > > > > > > > > -- > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > > Engagement > > > > | > > > > > New Relic > > > > > > > > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson < > [email protected]> > > > > > wrote: > > > > > > > > > > > We have the exact same problem. In our case, it's a bash operator > > > > > starting > > > > > > a docker container. The container and process it ran exit, but > the > > > > > 'docker > > > > > > run' command is still showing up in the process table, waiting > for > > an > > > > > > event. > > > > > > I'm trying to switch to LocalExecutor to see if that will help. > > > > > > > > > > > > _jonas > > > > > > > > > > > > > > > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <[email protected]> > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > Has anyone seen the behavior when using CeleryExecutor where > > > workers > > > > > will > > > > > > > finish their tasks ("INFO - Task exited with return code 0" > shows > > > in > > > > > the > > > > > > > logs) but are never marked as complete in the airflow DB or UI? > > > > > > Effectively > > > > > > > this causes tasks to hang even though they are complete, and > the > > > DAG > > > > > will > > > > > > > not continue. > > > > > > > > > > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps > > have a > > > > > > > workaround? > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > -- > > > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > > > > Engagement > > > > > > | > > > > > > > New Relic > > > > > > > ᐧ > > > > > > > > > > > > > > > > > > > > > > > > > > > >
