Hi Russell, Individual task instances connect to the database on a interval specified in your configuration file (30 secs by default) to emit heartbeats. In recent versions at each heartbeat, when a task instance entry in the task_instances table is set to "shutdown", is deleted or status differs from "running" somehow, the process will shut itself down properly. Shutting down properly means running the operator's `on_kill` method if defined, running the on_failure/on_retry callback if specified, bumping the retry number by one (though this may be on task restarting, I forgot) on marking the task_instance entry as failed (from memory). For that to be possible, all tasks run in a subprocess to the the parent process can handle the logic described above.
Similarly if a task fails to emit heartbeats for a certain period of time while the state is still set to "running", the scheduler will handle the failure itself. If retries are allowed for that task, it will be re-triggered on the following scheduler cycle. We used to have problems with "zombies" and "undead" but from my understanding the vast majority of these have been addressed. At scale pretty much anything can and will happen, and you may have in rare cases to kill some zombies on worker boxes where say both the parent and subprocesses are held up for some odd reason. Please share and try to get to the bottom of it if that does happen in your environment. If it's somewhat minimal or very sporadic I'd advise to automate a distributed unix command that kills old processes targeting your specific identified issue. Max On Tue, Feb 14, 2017 at 5:52 PM, Russell Jurney <[email protected]> wrote: > Ok, I deleted all references to the dag_id of this task in dag_run, jobs > and task_instance. > > The database doesn't seem to control this. What does? > > --- > Russell Jurney @rjurney <http://twitter.com/rjurney> > [email protected] LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > > On Tue, Feb 14, 2017 at 5:26 PM, Russell Jurney <[email protected]> > wrote: > > > I had a backfill operation that failed, and now I can't stop it from > > running! I have tried many times to clear the tasks, but this has no > > effect. I have tried stopping, clearing and restarting the scheduler, but > > this has no effect. > > > > I have opened the sqlite DB and want to remove the record that is causing > > the job to run, but I don't know which table (there are lots!)? Is it > just > > the database, or is there a file some place that I need to edit? > > > > Please help, because I run one thread on SQLite and so I can't get any > > other tasks to run until I clear this one :( > > --- > > Russell Jurney @rjurney <http://twitter.com/rjurney> > > [email protected] LI <http://linkedin.com/in/russelljurney> FB > > <http://facebook.com/jurney> datasyndrome.com > > >
