Are you sure there hasn't been a retry at that point? [One of] the expected behavior is the one I described, where if a task finished without reporting it's success [or failure], it will stay marked as RUNNING, but will fail to emit a heartbeat (which is a timestamp updated in the task_instance table as last_heartbeat or something). The scheduler monitors for RUNNING tasks without heartbeat and eventually will handle the failure (send emails, call on_failure_callback, ...).
Looking for heartbeat in the DB might give you some clues as to what is going on. Also there have been versions where we'd occasionally see double triggering, and double firing, which can be confusing. Then you can have different processes reporting their status and debugging those issues can be problematic. I think there's good prevention against that now, using database transactions as the task instance sets itself as RUNNING. I'm not sure if 1.8.0 is 100% clean from that regard. Max On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <[email protected]> wrote: > It happens mostly when the scheduler is catching up. More specifically, > when I load a brand new DAG with a start date in the past. Usually I have > it set to run 5 DAG runs at the same time, and up to 16 tasks at the same > time. > > What I've also noticed is that the tasks will sit completed in reality but > uncompleted in the Airflow DB for many hours, but if I just leave them all > sitting there over night they all tend to be marked complete the next > morning. Perhaps this points to some sort of Celery timeout or connection > retry interval? > ᐧ > > -- > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement | > New Relic > > On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin < > [email protected]> wrote: > > > By the time "INFO - Task exited with return code 0" gets logged, the task > > should have been marked as successful by the subprocess. I have no > specific > > intuition as to what the issue may be. > > > > I'm guessing at that point the job stops emitting heartbeat and > eventually > > the scheduler will handle it as a failure? > > > > How often does that happen? > > > > Max > > > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <[email protected]> wrote: > > > > > From what I can tell, it only affects CeleryExecutor. I've never seen > > this > > > behavior with LocalExecutor before. > > > > > > Max, do you know anything about this type of failure mode? > > > ᐧ > > > > > > -- > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > Engagement > > | > > > New Relic > > > > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <[email protected]> > > > wrote: > > > > > > > We have the exact same problem. In our case, it's a bash operator > > > starting > > > > a docker container. The container and process it ran exit, but the > > > 'docker > > > > run' command is still showing up in the process table, waiting for an > > > > event. > > > > I'm trying to switch to LocalExecutor to see if that will help. > > > > > > > > _jonas > > > > > > > > > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <[email protected]> > wrote: > > > > > > > > > Hello, > > > > > > > > > > Has anyone seen the behavior when using CeleryExecutor where > workers > > > will > > > > > finish their tasks ("INFO - Task exited with return code 0" shows > in > > > the > > > > > logs) but are never marked as complete in the airflow DB or UI? > > > > Effectively > > > > > this causes tasks to hang even though they are complete, and the > DAG > > > will > > > > > not continue. > > > > > > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps have a > > > > > workaround? > > > > > > > > > > Thanks! > > > > > > > > > > -- > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > > Engagement > > > > | > > > > > New Relic > > > > > ᐧ > > > > > > > > > > > > > > >
