Hey Max, Thanks for the suggestions. I believe it was a retry (I'm using remote logging so I can only check after the task completes), but the UI never reported it as such. The latest_heartbeat column is in the jobs table, and interestingly I do see some running jobs that haven't heartbeated for ~22 minutes. They are LocalTaskJob instances with CeleryExecutor properly listed as the executory class. I can't really correlate these to a specific task instance, however, as there doesn't appear to be any key written to the jobs table (the dag_id column is all null, and there's no task_id column or anything).
Any ideas on what could be making these tasks stop heartbeating regularly? That could explain why eventually (after an overnight period of time) everything is marked as finished in the Airflow UI: eventually these tasks do heartbeat again, but quite long after they are finished running. Thanks again! ᐧ -- Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement | New Relic On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin < [email protected]> wrote: > Are you sure there hasn't been a retry at that point? [One of] the expected > behavior is the one I described, where if a task finished without reporting > it's success [or failure], it will stay marked as RUNNING, but will fail to > emit a heartbeat (which is a timestamp updated in the task_instance table > as last_heartbeat or something). The scheduler monitors for RUNNING tasks > without heartbeat and eventually will handle the failure (send emails, call > on_failure_callback, ...). > > Looking for heartbeat in the DB might give you some clues as to what is > going on. > > Also there have been versions where we'd occasionally see double > triggering, and double firing, which can be confusing. Then you can have > different processes reporting their status and debugging those issues can > be problematic. I think there's good prevention against that now, using > database transactions as the task instance sets itself as RUNNING. I'm not > sure if 1.8.0 is 100% clean from that regard. > > Max > > On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <[email protected]> wrote: > > > It happens mostly when the scheduler is catching up. More specifically, > > when I load a brand new DAG with a start date in the past. Usually I have > > it set to run 5 DAG runs at the same time, and up to 16 tasks at the same > > time. > > > > What I've also noticed is that the tasks will sit completed in reality > but > > uncompleted in the Airflow DB for many hours, but if I just leave them > all > > sitting there over night they all tend to be marked complete the next > > morning. Perhaps this points to some sort of Celery timeout or connection > > retry interval? > > ᐧ > > > > -- > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement > | > > New Relic > > > > On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin < > > [email protected]> wrote: > > > > > By the time "INFO - Task exited with return code 0" gets logged, the > task > > > should have been marked as successful by the subprocess. I have no > > specific > > > intuition as to what the issue may be. > > > > > > I'm guessing at that point the job stops emitting heartbeat and > > eventually > > > the scheduler will handle it as a failure? > > > > > > How often does that happen? > > > > > > Max > > > > > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <[email protected]> wrote: > > > > > > > From what I can tell, it only affects CeleryExecutor. I've never seen > > > this > > > > behavior with LocalExecutor before. > > > > > > > > Max, do you know anything about this type of failure mode? > > > > ᐧ > > > > > > > > -- > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > Engagement > > > | > > > > New Relic > > > > > > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <[email protected]> > > > > wrote: > > > > > > > > > We have the exact same problem. In our case, it's a bash operator > > > > starting > > > > > a docker container. The container and process it ran exit, but the > > > > 'docker > > > > > run' command is still showing up in the process table, waiting for > an > > > > > event. > > > > > I'm trying to switch to LocalExecutor to see if that will help. > > > > > > > > > > _jonas > > > > > > > > > > > > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <[email protected]> > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > Has anyone seen the behavior when using CeleryExecutor where > > workers > > > > will > > > > > > finish their tasks ("INFO - Task exited with return code 0" shows > > in > > > > the > > > > > > logs) but are never marked as complete in the airflow DB or UI? > > > > > Effectively > > > > > > this causes tasks to hang even though they are complete, and the > > DAG > > > > will > > > > > > not continue. > > > > > > > > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps > have a > > > > > > workaround? > > > > > > > > > > > > Thanks! > > > > > > > > > > > > -- > > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > > > Engagement > > > > > | > > > > > > New Relic > > > > > > ᐧ > > > > > > > > > > > > > > > > > > > > >
