tomrutter opened a new issue, #32091:
URL: https://github.com/apache/airflow/issues/32091

   ### Apache Airflow version
   
   2.6.2
   
   ### What happened
   
   We are running a dag with many deferrable tasks using a custom trigger that 
waits for an Azure Batch task to complete. When many tasks have been deferred, 
we can an intermittent error in the Triggerer. The logged error message is the 
following:
   
   Exception in thread Thread-2:
   Traceback (most recent call last):
     File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
       self.run()
     File 
"/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py",
 line 457, in run
       asyncio.run(self.arun())
     File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
       return loop.run_until_complete(main)
     File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in 
run_until_complete
       return future.result()
     File 
"/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py",
 line 470, in arun
       await self.create_triggers()
     File 
"/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py",
 line 492, in create_triggers
       dag_id = task_instance.dag_id
   AttributeError: 'NoneType' object has no attribute 'dag_id'
   
   After this error occurs, the trigger still reports as healthy, but no events 
are triggered. Restarting the triggerer fixes the problem.
   
   ### What you think should happen instead
   
   The specific error in the trigger should be addressed to prevent the 
triggerer async thread from crashing.
   
   The triggerer should not perform heartbeat updates when the async triggerer 
thread has crashed.
   
   ### How to reproduce
   
   This occurs intermittently, and seems to be the results of running more than 
one triggerer. Running many deferred tasks eventually ends up with this error 
occurring.
   
   
   
   ### Operating System
   
   linux (standard airflow slim images extended with custom code running on 
kubernetes)
   
   ### Versions of Apache Airflow Providers
   
   postgres,celery,redis,ssh,statsd,papermill,pandas,github_enterprise
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Azure Kubernetes and helm chart 1.9.0.
   2 replicas of both triggerer and scheduler.
   
   ### Anything else
   
   It seems that as triggers fire, the link between the trigger row and the 
associated task_instance for the trigger is removed before the trigger row is 
removed. This leaves a small amount of time where the trigger exists without an 
associated task_instance. The database updates are performed in a synchronous 
loop inside the triggerer, so with one triggerer, this is not a problem. 
However, it can be a problem with more than one triggerer.
   
   Also, once the triggerer async loop (that handles the trigger code) fails, 
the triggers no longer fire. However, the heartbeat is handled by the 
synchronous loop so the job still reports as healthy.
   
   I have included an associated PR to resolve these issues.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to