in my case the dag goes into a failed state when a task is in a running state
On 2018/03/07 14:29:31, "Kamenik, John" <jkame...@fourv.com> wrote: > Nothing specific that I can see. > > > I saw that the online that the docker image pins Celery to 4.0.2. > https://github.com/puckel/docker-airflow. > > > With the upgrade to Airflow 1.9 we upgraded all packages including Celery. > It was 4.1. I have downgraded to Celery 4.0.2 and things appear to be more > stable. > > > With Celery 4.0.2 in place I have run 30 copies of the failing DAGs 30 times > each all in parallel and there were 2 failures total. So I wouldn't say the > issue is fixed completely, but certainly less bad then with Celery 4.1. Down > from a 33% failure rate to a 0.22% failure rate. > > > > > - John K > > ________________________________ > From: Maxime Beauchemin <maximebeauche...@gmail.com> > Sent: Wednesday, March 7, 2018 1:20:05 AM > To: dev@airflow.incubator.apache.org > Subject: Re: Airflow looses track of Dag Tasks > > Notice: This message originated outside of SRC. > > Anything else specific? I heard SubDags can have issues under certain > conditions, is this happening inside SubDags? Anybody else in the community > has experienced anything like this on 1.9? > > On Mon, Mar 5, 2018 at 9:13 AM, Kamenik, John <jkame...@fourv.com> wrote: > > > Airflow scheduler, flower, and webserver are within a statefulset. > > Workers are in another statefulset. Dags are shared between scheduler and > > workers via NFS; CeleryExecuter. > > > > > > The issue happens quite often. There are hundreds of DAGs that run every > > day. Every DAG has failed at least once; most fail at least once every 3 > > days. On average about 1/3 to 1/2 of all dag runs fail in a given day. No > > pattern of failure that we can see; other then it looks like Celery looses > > track of the task or the Task details in the database get corrupt. > > > > > > There is no obvious error in the output for any of the services: postgres, > > redis, scheduler, flower, or workers. If we do find the worker logs > > (sometimes we cannot) then it usually indicates that the script called runs > > to completion and is a success. > > > > > > Not sure where the issue might be. > > > > > > > > > > - John K > > > > ________________________________ > > From: Maxime Beauchemin <maximebeauche...@gmail.com> > > Sent: Monday, March 5, 2018 11:57:16 AM > > To: dev@airflow.incubator.apache.org > > Subject: Re: Airflow looses track of Dag Tasks > > > > Notice: This message originated outside of SRC. > > > > Are you using the Kubernetes executor or running Airflow worker(s) inside > > persistent pod? > > > > How often does that happen? Does it randomly occur on any task, any pattern > > there? > > > > Max > > > > On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <jkame...@fourv.com> wrote: > > > > > I have an airflow 1.9 cluster setup on kubernetes and I have an issue > > > where a random DAG Task shows as fail because it appears that Airflow has > > > lost track of it. The cluster consists of a database, redis store, > > > scheduler, and 14 workers. > > > > > > > > > What happens is the task starts as normal, runs, and exits, but instead > > of > > > status being written the Operator, State Date, Job ID, and Hostname are > > > erased. Shortly there after an endtime is added and the state is set to > > > failed. > > > > > > > > > Given the hostname is erased I to brute force find the logs of the worker > > > that executed the task. If I can find the Task logs it indicates the > > > command (BashOperator) ran to completion and exited cleanly. I don't see > > > any errors in the airflow scheduler or any workers that would indicate > > any > > > issues. I am not sure what else to debug. > > > > > > > > > > > > > > > - John K > > > > > >