I know with a scheduler restart, tasks that may still report as running
even though they are not.

On Wed, Aug 9, 2017 at 6:07 PM, David Klosowski <dav...@thinknear.com>
wrote:

> Hi Gerard,
>
> The interesting thing is that we didn't see this issue in 1.7.1.3 but we
> did when upgrading to 1.8.0.
>
> We aren't seeing any timeout on the task in question to be quite honest.
> The state of the task never changes and we have reasonable timeouts on our
> tasks that would notify us.  The task is in fact "stuck" w/o reporting any
> status.  There are other cases where tasks do in fact fail and then go into
> retry state, which we see normally (this happens quite a bit for us on
> deploys).  There is clearly some edge case here where the failure -> retry
> does not happen and the dagrun never updates.
>
> What we do see timeouts on Sensors that depend on those tasks and we've
> added SLAs to some of our important tasks to see issues earlier.
>
> Does anyone know where this code lives?  Is that a function of the
> dagrun_timeout?
>
> Thanks.
>
> Regards,
> David
>
>
>
>
>
>
> On Mon, Aug 7, 2017 at 1:30 PM, Gerard Toonstra <gtoons...@gmail.com>
> wrote:
>
> > Hi David,
> >
> > When tasks are put on the MQ, they are out of the control of the
> scheduler.
> > The scheduler puts the state of that task instance in "queued".
> >
> > What happens next:
> >
> > 1. A worker picks up the task to run and tries to run it.
> > 2. It first executes a couple of checks against the DB prior to executing
> > this. These are final instance checks to see
> >     if it should still run when the worker is about to pick up the task
> > (another could have processed, started processing, etc).
> > 3. The worker puts the state of the TI in "running".
> > 4. The worker does the work as described in the operator
> > 5. The worker then updates the database with fail or success.
> >
> > If you kill the docker container doing the execution prior to it having
> > updated the state to success or fail,
> > it will get into a situation where a timeout must occur to get airflow to
> > see if the task failed or not. This is because
> > the worker is claiming to be processing the message, but this worker/task
> > got killed.
> >
> > It is actually the task instance updating the database, so if you leave
> > that container running, it will possibly finish
> > and update the db.
> >
> >
> > The task results are also communicated back to the executors and there's
> a
> > check to see if the results agree.
> >
> > You can find this code in models.py / Taskinstance / run()   and any
> > Executor you are using under (airflow/executors).
> >
> >
> > The reason why this happens I think is because docker doesn't really care
> > what's running at the moment, it's assuming 'services',
> > where you may have interruption of services because they are retried all
> > the time anyway. In an environment like airflow,
> > There's a persistent backend database that doesn't automatically retry
> > because it's driven through the scheduler, which only sees
> > a "RUNNING" record in the database.
> >
> > How to deal with this depends on your situation. If you run only short
> > running tasks (up to 5 mins), you could drain the task queue
> > by stopping the scheduler first. This means no new messages are sent to
> the
> > queue, so after 10 mins you should have no tasks running on any workers.
> >
> > Another way is to update the database inbetween, but I'd personally avoid
> > that as much as you can.
> >
> >
> > Not sure if anyone wants to chime in here on how to best deal with this
> in
> > docker?
> >
> > Rgds,
> >
> > Gerard
> >
> >
> > On Mon, Aug 7, 2017 at 8:21 PM, David Klosowski <dav...@thinknear.com>
> > wrote:
> >
> > > Hi Airflow Dev List:
> > >
> > > Has anyone had cases where tasks get "stuck"?  What I mean by "stuck"
> is
> > > that tasks show as running through the Airflow UI but never actually
> run
> > > (and dependent tasks will eventually timeout).
> > >
> > > This only happens during our deployments and we replace all the hosts
> in
> > > our stack (3 workers and 1 host with the scheduler + webserver +
> flower)
> > > with a dockerized deployment.  We've been deploying to the worker hosts
> > > after the scheduler + webserver + flower host.
> > >
> > > It also doesn't occur all the time, which is a bit frustrating to try
> to
> > > debug.
> > >
> > > We have the following settings:
> > >
> > > > celery_result_backend = Postgres
> > > > sql_alchmey_conn = Postgres
> > > > broker_url = Redis
> > > > exector = CeleryExecutor
> > >
> > > Any thoughts from anyone regarding known issues or observed problems?
> I
> > > haven't seen a jira on this after looking through the Airflow jira.
> > >
> > > Thanks.
> > >
> > > Regards,
> > > David
> > >
> >
>

Reply via email to