Gollum999 opened a new issue #22133:
URL: https://github.com/apache/airflow/issues/22133


   ### Apache Airflow version
   
   2.2.3
   
   ### What happened
   
   When triggering a DAG manually (via the web or via `airflow dags trigger`), 
some template params like `ds`, `ts`, and others derived from 
`dag_run.logical_date` will be set to the specified execution timestamp.  This 
is inconsistent with automated runs where those fields are set to 
`data_interval_start`.  This behavior contradicts the documentation in a few 
places, and can cause tasks that depend on those template params to behave 
unintuitively.
   
   ### What you expected to happen
   
   I expected `ds` to always equal `data_interval_start`.  Quoting the docs in 
a few different places (emphasis mine):
   
   [DAG Runs: Data 
Interval](https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#data-interval)
   > The “logical date” (also called `execution_date` in Airflow versions prior 
to 2.2) of a DAG run, for example, **denotes the start of the data interval**, 
not when the DAG is actually executed.
   
   [FAQ: What does `execution_date` 
mean?](https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-does-execution-date-mean)
   > Note that `ds` (**the YYYY-MM-DD form of `data_interval_start`**) refers 
to date *string*, not date *start* as may be confusing to some.
   
   However, it's worth noting that [DAGs: Running 
DAGs](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#running-dags)
 *does* seem to explain this edge case:
   > For example, if a DAG run is manually triggered by the user, its logical 
date would be the date and time of which the DAG run was triggered, and the 
value should be equal to DAG run’s start date. However, when the DAG is being 
automatically scheduled, with certain schedule interval put in place, the 
logical date is going to indicate the time at which it marks the start of the 
data interval, where the DAG run’s start date would then be the logical date + 
scheduled interval.
   
   ### How to reproduce
   
   Example DAG:
   ```
   #!/usr/bin/env python3
   from datetime import datetime
   
   from airflow import DAG
   from airflow.operators.bash import BashOperator
   
   
   default_args = {
       'retries': 0,
   }
   with DAG(
           'test_dag',
           default_args=default_args,
           schedule_interval='@weekly',
           start_date=datetime(2022, 1, 1),
           catchup=False,
   ) as dag:
       BashOperator(task_id='task', bash_command="""echo "
           ds: {{ ds }}
           prev_ds: {{ prev_ds }}
           next_ds: {{ next_ds }}
           ts: {{ ts }}
           execution_date: {{ execution_date }}
           data_interval_start: {{ data_interval_start }}
           data_interval_end: {{ data_interval_end }}
           dag_run.logical_date {{ dag_run.logical_date }}
       "
       """)
   ```
   Trigger this dag via web or via `airflow dags trigger test_dag -e <some 
timestamp>`, then look at output in the logs.
   
   Example output for an automated run:
   ```
   [2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         ds: 2022-02-27
   [2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         prev_ds: 
2022-02-20
   [2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         next_ds: 
2022-03-06
   [2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         ts: 
2022-02-27T00:00:00+00:00
   [2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         execution_date: 
2022-02-27T00:00:00+00:00
   [2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         
data_interval_start: 2022-02-27T00:00:00+00:00
   [2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         
data_interval_end: 2022-03-06T00:00:00+00:00
   [2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         
dag_run.logical_date 2022-02-27 00:00:00+00:00
   ```
   Example output for a manually-triggered run:
   ```
   [2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         ds: 2022-03-08
   [2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         prev_ds: 
2022-03-08
   [2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         next_ds: 
2022-03-08
   [2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         ts: 
2022-03-08T22:23:58+00:00
   [2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         execution_date: 
2022-03-08T22:23:58+00:00
   [2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         
data_interval_start: 2022-02-27T00:00:00+00:00
   [2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         
data_interval_end: 2022-03-06T00:00:00+00:00
   [2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         
dag_run.logical_date 2022-03-08 22:23:58+00:00
   ```
   
   ### Operating System
   
   CentOS 7.4
   
   ### Versions of Apache Airflow Providers
   
   Only the defaults.
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   Just running processes locally.
   
   ### Anything else
   
   I'm not convinced that this is just a documentation issue; the fact that 
`logical_date` and all derived fields can have contextually different meanings 
seems fundamentally broken to me.  To keep my users from running into issues, I 
feel like I am forced to teach them either "never use `ds`/`ts`/etc." or "never 
trigger DAGs manually", neither of which feels great.
   
   As far as I can tell, there is no way to manually trigger a dag and have it 
behave exactly like a "normal" automated run since `ds` will always fall 
outside of the data interval.  Which begs the question: What does it even mean 
to manually trigger a DAG Run when data intervals are involved?  It shouldn't 
be able to affect the existing schedule, so the current behavior of "snapping" 
to the latest complete data interval makes sense to me.  But for consistency, I 
think all `dag_run` fields (except for things like `run_id`) should follow that 
same behavior.
   
   Alternatively, maybe there are two classes of DAGs: Ones that operate on 
data intervals, and ones that operate on a single instant in time (e.g. 
`schedule_interval=None`).  And perhaps the former should never be manually 
triggered and should only ever use something like `airflow dags backfill` to 
run specific intervals.  And ideally the web and CLI would reflect this to 
prevent running a DAG "the wrong way".
   
   Admittedly I am new to Airflow, so maybe my intuitions are not correct.  And 
I recognize that there are almost certainly some users that depend on the 
current behavior, so it would definitely be a pain to change.  But I'm curious 
to hear if other people have thoughts about this or specific examples of why 
the current behavior is desirable.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to