I know with a scheduler restart, tasks that may still report as running even though they are not.
On Wed, Aug 9, 2017 at 6:07 PM, David Klosowski <dav...@thinknear.com> wrote: > Hi Gerard, > > The interesting thing is that we didn't see this issue in 1.7.1.3 but we > did when upgrading to 1.8.0. > > We aren't seeing any timeout on the task in question to be quite honest. > The state of the task never changes and we have reasonable timeouts on our > tasks that would notify us. The task is in fact "stuck" w/o reporting any > status. There are other cases where tasks do in fact fail and then go into > retry state, which we see normally (this happens quite a bit for us on > deploys). There is clearly some edge case here where the failure -> retry > does not happen and the dagrun never updates. > > What we do see timeouts on Sensors that depend on those tasks and we've > added SLAs to some of our important tasks to see issues earlier. > > Does anyone know where this code lives? Is that a function of the > dagrun_timeout? > > Thanks. > > Regards, > David > > > > > > > On Mon, Aug 7, 2017 at 1:30 PM, Gerard Toonstra <gtoons...@gmail.com> > wrote: > > > Hi David, > > > > When tasks are put on the MQ, they are out of the control of the > scheduler. > > The scheduler puts the state of that task instance in "queued". > > > > What happens next: > > > > 1. A worker picks up the task to run and tries to run it. > > 2. It first executes a couple of checks against the DB prior to executing > > this. These are final instance checks to see > > if it should still run when the worker is about to pick up the task > > (another could have processed, started processing, etc). > > 3. The worker puts the state of the TI in "running". > > 4. The worker does the work as described in the operator > > 5. The worker then updates the database with fail or success. > > > > If you kill the docker container doing the execution prior to it having > > updated the state to success or fail, > > it will get into a situation where a timeout must occur to get airflow to > > see if the task failed or not. This is because > > the worker is claiming to be processing the message, but this worker/task > > got killed. > > > > It is actually the task instance updating the database, so if you leave > > that container running, it will possibly finish > > and update the db. > > > > > > The task results are also communicated back to the executors and there's > a > > check to see if the results agree. > > > > You can find this code in models.py / Taskinstance / run() and any > > Executor you are using under (airflow/executors). > > > > > > The reason why this happens I think is because docker doesn't really care > > what's running at the moment, it's assuming 'services', > > where you may have interruption of services because they are retried all > > the time anyway. In an environment like airflow, > > There's a persistent backend database that doesn't automatically retry > > because it's driven through the scheduler, which only sees > > a "RUNNING" record in the database. > > > > How to deal with this depends on your situation. If you run only short > > running tasks (up to 5 mins), you could drain the task queue > > by stopping the scheduler first. This means no new messages are sent to > the > > queue, so after 10 mins you should have no tasks running on any workers. > > > > Another way is to update the database inbetween, but I'd personally avoid > > that as much as you can. > > > > > > Not sure if anyone wants to chime in here on how to best deal with this > in > > docker? > > > > Rgds, > > > > Gerard > > > > > > On Mon, Aug 7, 2017 at 8:21 PM, David Klosowski <dav...@thinknear.com> > > wrote: > > > > > Hi Airflow Dev List: > > > > > > Has anyone had cases where tasks get "stuck"? What I mean by "stuck" > is > > > that tasks show as running through the Airflow UI but never actually > run > > > (and dependent tasks will eventually timeout). > > > > > > This only happens during our deployments and we replace all the hosts > in > > > our stack (3 workers and 1 host with the scheduler + webserver + > flower) > > > with a dockerized deployment. We've been deploying to the worker hosts > > > after the scheduler + webserver + flower host. > > > > > > It also doesn't occur all the time, which is a bit frustrating to try > to > > > debug. > > > > > > We have the following settings: > > > > > > > celery_result_backend = Postgres > > > > sql_alchmey_conn = Postgres > > > > broker_url = Redis > > > > exector = CeleryExecutor > > > > > > Any thoughts from anyone regarding known issues or observed problems? > I > > > haven't seen a jira on this after looking through the Airflow jira. > > > > > > Thanks. > > > > > > Regards, > > > David > > > > > >