alex-astronomer opened a new issue #21072:
URL: https://github.com/apache/airflow/issues/21072


   ### Apache Airflow version
   
   2.1.4
   
   ### What happened
   
   SLAMiss is firing notifications (Slack notification, as defined by the 
sla_miss_callback) but every time it calls the sla_miss_callback it's sending 
notifications for the same set of tasks.  It seems as though the notification 
sent flag in the database is never set to true.  This happens when there are a 
large number of sla misses that need to be processed at the same time.
   
   The use case for this is backfilling a DAG that runs frequently starting at 
~1 month ago.  This causes around 14k sla misses to need to be processed all at 
the same time.
   
   ### What you expected to happen
   
   Expected that sla_miss_callback is called, and then by the end of managing 
the SLAs, they no longer need to be processed.  Expect that SLAs are managed 
one time, and then not used again when managing SLAs.
   
   We found the root cause for this issue.  This happens because the 
DAGFileProcessor is timing out before the transactions that change notification 
sent = True for the SLAs to be committed to the database.  This is a somewhat 
weird "in-between" case because the timeout is long enough that the 
sla_miss_callback runs, but not long enough that all of the flags can be 
changed in the database.  This causes the same SLAs to be processed over and 
over again every time we manage SLAs.
   
   The offending line in the code base is the commit call at the end of manage 
SLAs.  When we try to commit the changes to all 14k records, the 
DAGFileProcessor times out in the middle of that line.
   
   ### How to reproduce
   
   Generate many SLA misses all at once.  This can be triggered by setting the 
start date for a DAG in the past and setting it to run frequently.  Then, once 
manage slas is called, we process all of the SLA misses at the same time, 
causing a pile up in the system.
   
   After, we have to get the timeout just right such that sla_miss_callback 
runs, but the transactions are not committed to the database.  This will all 
depend on the system that this reproduction is running on.
   
   ### Operating System
   
   macOS Big Sur 11.3.1
   
   ### Versions of Apache Airflow Providers
   
   n/a
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to