SCrocky opened a new issue, #56750:
URL: https://github.com/apache/airflow/issues/56750

   ### Apache Airflow version
   
   3.1.0
   
   ### If "Other Airflow 2/3 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   ### Asset scheduling behaviors
   
   Asset Event triggered DAGs behave one of 3 different ways:
   
       1. A single Asset Event triggers a single DAG Run
       2. Multiple Asset Events trigger a single DAG Run
       3. Asset Events that haven't triggered a DAG Run, but are older than the 
last run are silently ignored
   
   ### How to make Datasets Behave differently
   
   To force behavior 2 & 3 to happen, one can set `max_active_runs=1` and every 
time the DAG runs it wall "consume" (either via behavior 2 or 3) all available 
Asset Events.
   
   To force behavior 1, one must set `max_active_runs` to a high value, and 
hope that Asset Events are not generate faster than the scheduler runs (or else 
we fall into behavior 2) 
   
   It is important to note that the `catchup` argument does not seem to affect 
this mechanic in any way.
   
   ### The main Issue
   
   The main issue here is:
   
   ### Asset Event Scheduling behaves in very different ways, based on DAG 
parallelism & Airflow Scheduler performance
   
   These things should be unrelated, and as far as I could tell, this behavior 
is undocumented.
   
   
   ### Linked Issues
   
   Other issues that would likely be solved by addressing this issue:
   
   https://github.com/apache/airflow/issues/56749 (UI changes)
   https://github.com/apache/airflow/issues/53896 (distinct DAG Run per Asset 
Event)
   https://github.com/apache/airflow/issues/50890 (want catchup on Assets)
   https://github.com/apache/airflow/issues/56691 (distinct DAG Run per Asset 
Event)
   https://github.com/apache/airflow/issues/56050 (Max active runs = 1 changes 
behavior)
   https://github.com/apache/airflow/issues/55956 (Force separate Events)
   
   Unclear issues that may be related:
   
   https://github.com/apache/airflow/issues/56541 ? (unclear)
   https://github.com/apache/airflow/issues/42015 ? (unclear)
   
   ### What you think should happen instead?
   
   In my professional setting we use both behavior 1 (for Event based 
scheduling) and behavior 2 & 3 (for table refreshes). Check out my [Talk from 
Airflow Summit 
2025](https://airflowsummit.org/sessions/2025/multi-instance-asset-synchronization-push-or-pull/)
 for more details.
   
   So I suggest we make the Asset Event DAG triggering behavior configurable on 
a DAG level.
   
   For example by adding a `asset_grouping` argument:
   
   - if `asset_grouping=True` then we have behavior 2
    - if `asset_grouping=False` then we have behavior 1
   
   Behavior 3 is a bug in my opinion and should never happen.
   I've put more info on the Asset Event attribution in [this 
issue](https://github.com/apache/airflow/issues/56749)
   
   I also suggest we rename `catchup` to `time_interval_catchup` or some 
similar value, so that it is clear it does not apply to Asset Event based 
scheduling.
   
   And we should document all this stuff.
   
   ### How to reproduce
   
   To reproduce simply upload the following DAGs in a brand new Airflow 
instance:
   
   
[check_dataset_sync.py](https://github.com/user-attachments/files/22961935/check_dataset_sync.py)
   
   make sure to use a DB other than SQlite so you can compare the difference 
between `max_active_runs=1` and `max_active_runs=10`.
   
   Then use the `airflow standalone` command.
   Turn all the DAGs on. 
   
   
   You should obtain the following DAGs:
   
   
![Image](https://github.com/user-attachments/assets/6e5af6fb-e3eb-4c42-981e-9ed7e6256042)
   
   And manually trigger the asset generator DAG once.
   
   
![Image](https://github.com/user-attachments/assets/6519d3f8-cbbd-4b7b-b961-16e34160d549)
   
   You will then see that the non-parallel DAGs only trigger twice, and the 
parallel DAG triggers 4-5 times, depending on scheduler frequency.
   
   You can check the logs to see how many Asset Events each DAG is consuming:
   
   
![Image](https://github.com/user-attachments/assets/fb0cf001-5896-4a9a-bf40-c4f540127d19)
   
   You can also do similar tests for Event Driven Asset Events:
   
   
[event_scheduling_test.py](https://github.com/user-attachments/files/22961969/event_scheduling_test.py)
   
   But be sure to add your dags repo to the PYTHONPATH `export 
PYTHONPATH=$AIRFLOW_HOME/dags`
   
   ### Operating System
   
   Ubuntu 24
   
   ### Versions of Apache Airflow Providers
   
   ```
   apache-airflow-providers-common-compat   1.7.3
   apache-airflow-providers-common-io       1.6.2
   apache-airflow-providers-common-sql      1.27.5
   apache-airflow-providers-postgres        6.2.3
   apache-airflow-providers-smtp            2.2.0
   apache-airflow-providers-standard        1.6.0
   ```
   
   ### Deployment
   
   Virtualenv installation
   
   ### Deployment details
   
   Using postgres for the Airflow DB
   
   ### Anything else?
   
   @cmarteepants I've finally gotten around to making this issue as previously 
discussed.
   
   Let me know if everything is clear and understandable.
   
   @uranusjr enjoy ;)
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to