Airflow scheduler, flower, and webserver are within a statefulset.  Workers are 
in another statefulset.  Dags are shared between scheduler and workers via NFS; 
CeleryExecuter.


The issue happens quite often.  There are hundreds of DAGs that run every day.  
Every DAG has failed at least once; most fail at least once every 3 days.  On 
average about 1/3 to 1/2 of all dag runs fail in a given day.  No pattern of 
failure that we can see; other then it looks like Celery looses track of the 
task or the Task details in the database get corrupt.


There is no obvious error in the output for any of the services: postgres, 
redis, scheduler, flower, or workers.  If we do find the worker logs (sometimes 
we cannot) then it usually indicates that the script called runs to completion 
and is a success.


Not sure where the issue might be.




- John K

________________________________
From: Maxime Beauchemin <[email protected]>
Sent: Monday, March 5, 2018 11:57:16 AM
To: [email protected]
Subject: Re: Airflow looses track of Dag Tasks

Notice: This message originated outside of SRC.

Are you using the Kubernetes executor or running Airflow worker(s) inside
persistent pod?

How often does that happen? Does it randomly occur on any task, any pattern
there?

Max

On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <[email protected]> wrote:

> I have an airflow 1.9 cluster setup on kubernetes and I have an issue
> where a random DAG Task shows as fail because it appears that Airflow has
> lost track of it.  The cluster consists of a database, redis store,
> scheduler, and 14 workers.
>
>
> What happens is the task starts as normal, runs, and exits, but instead of
> status being written the Operator, State Date, Job ID, and Hostname are
> erased.   Shortly there after an endtime is added and the state is set to
> failed.
>
>
> Given the hostname is erased I to brute force find the logs of the worker
> that executed the task.  If I can find the Task logs it indicates the
> command (BashOperator) ran to completion and exited cleanly.  I don't see
> any errors in the airflow scheduler or any workers that would indicate any
> issues.  I am not sure what else to debug.
>
>
>
>
> - John K
>

Reply via email to