tomrutter opened a new issue, #32091:
URL: https://github.com/apache/airflow/issues/32091
### Apache Airflow version
2.6.2
### What happened
We are running a dag with many deferrable tasks using a custom trigger that
waits for an Azure Batch task to complete. When many tasks have been deferred,
we can an intermittent error in the Triggerer. The logged error message is the
following:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File
"/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py",
line 457, in run
asyncio.run(self.arun())
File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in
run_until_complete
return future.result()
File
"/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py",
line 470, in arun
await self.create_triggers()
File
"/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py",
line 492, in create_triggers
dag_id = task_instance.dag_id
AttributeError: 'NoneType' object has no attribute 'dag_id'
After this error occurs, the trigger still reports as healthy, but no events
are triggered. Restarting the triggerer fixes the problem.
### What you think should happen instead
The specific error in the trigger should be addressed to prevent the
triggerer async thread from crashing.
The triggerer should not perform heartbeat updates when the async triggerer
thread has crashed.
### How to reproduce
This occurs intermittently, and seems to be the results of running more than
one triggerer. Running many deferred tasks eventually ends up with this error
occurring.
### Operating System
linux (standard airflow slim images extended with custom code running on
kubernetes)
### Versions of Apache Airflow Providers
postgres,celery,redis,ssh,statsd,papermill,pandas,github_enterprise
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
Azure Kubernetes and helm chart 1.9.0.
2 replicas of both triggerer and scheduler.
### Anything else
It seems that as triggers fire, the link between the trigger row and the
associated task_instance for the trigger is removed before the trigger row is
removed. This leaves a small amount of time where the trigger exists without an
associated task_instance. The database updates are performed in a synchronous
loop inside the triggerer, so with one triggerer, this is not a problem.
However, it can be a problem with more than one triggerer.
Also, once the triggerer async loop (that handles the trigger code) fails,
the triggers no longer fire. However, the heartbeat is handled by the
synchronous loop so the job still reports as healthy.
I have included an associated PR to resolve these issues.
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]