Charlie created AIRFLOW-6194:
--------------------------------
Summary: Task instances aren't running after meeting dependencies
Key: AIRFLOW-6194
URL: https://issues.apache.org/jira/browse/AIRFLOW-6194
Project: Apache Airflow
Issue Type: Bug
Components: DagRun, executors, scheduler, worker
Affects Versions: 1.10.6
Reporter: Charlie
We recently had an issue arise with our Airflow instance which caused the
scheduler to enter some sort of a deadlocked state in the middle of operation.
In this state, all DAG runs were listed as 'scheduled' and it didn't appear as
if anything at all was happening.
Initially, I thought this might be an issue with our configuration, but I
couldn't quite track down why this issue wouldn't have arisen earlier and,
looking at the logs, I've been seeing some strange behavior that I can't quite
explain.
The most notable thing is that, for whatever reason, the Executor Class listed
under all of our jobs is 'NoneType', previously 'LocalExecutor'. Looking at our
logs, this change initially happened when we updated our instance two days
prior to this initial deadlock, however, I have since cleared the database
altogether and find that even starting from scratch, 'NoneType' is appearing.
In these same logs, I can see jobs continuously running for this DAG run,
however the start and end times are less than a second apart. At the same time,
all task instances are either listed a 'success' or 'scheduled' so I'm not
entirely sure what the running jobs are.
If I look in the Task Instance Details for any of these scheduled tasks, I see
{code:java}
All dependencies are met but the task instance is not running. In most cases
this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow
administrator for assistance.{code}
Upon viewing the logs in the airflow for the scheduler, nothing seem awry.
So to summarize, the scheduler seems to be doing it's job, as DAG runs are
properly scheduled and set as 'running' however the instances themselves are
not completing properly. Due to the listing of 'NoneType' instead of
'LocalExecutor' for the jobs, my theory is that there is some issue with the
LocalExecutor, that's causing it not properly execute jobs. Again, clearing the
database didn't seem to help this, and I now run into this deadlock almost
immediately with a test DAG I'm running.
If I can provide any additional information, please let me know. I'd love to
get this resolved or figured out, as we're currently unable to use Airflow
because of this.
Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)