This is an automated email from the ASF dual-hosted git repository.

uranusjr pushed a commit to branch aip-39-docs
in repository https://gitbox.apache.org/repos/asf/airflow.git

commit bce29bc40eaece4cb3d2a569a4f1984898867f5d
Author: Tzu-ping Chung <[email protected]>
AuthorDate: Wed Aug 11 22:29:08 2021 +0800

    WIP
---
 docs/apache-airflow/concepts/dags.rst   | 14 ++++++--
 docs/apache-airflow/dag-run.rst         | 32 ++++++++++++-----
 docs/apache-airflow/howto/index.rst     |  1 +
 docs/apache-airflow/howto/timetable.rst | 63 +++++++++++++++++++++++++++++++++
 4 files changed, 99 insertions(+), 11 deletions(-)

diff --git a/docs/apache-airflow/concepts/dags.rst 
b/docs/apache-airflow/concepts/dags.rst
index c564ef8..acbf24c 100644
--- a/docs/apache-airflow/concepts/dags.rst
+++ b/docs/apache-airflow/concepts/dags.rst
@@ -148,14 +148,24 @@ The ``schedule_interval`` argument takes any value that 
is a valid `Crontab <htt
     with DAG("my_daily_dag", schedule_interval="0 * * * *"):
         ...
 
-Every time you run a DAG, you are creating a new instance of that DAG which 
Airflow calls a :doc:`DAG Run </dag-run>`. DAG Runs can run in parallel for the 
same DAG, and each has a defined ``execution_date``, which identifies the 
*logical* date and time it is running for - not the *actual* time when it was 
started.
+.. tip::
+
+    For more information on ``schedule_interval`` values, see :doc:`DAG Run 
</dag-run>`.
+
+    If ``schedule_interval`` is not enough to express the DAG's schedule, see 
:doc:`Timetables </howto/timetable>`.
+
+Every time you run a DAG, you are creating a new instance of that DAG which 
Airflow calls a :doc:`DAG Run </dag-run>`. DAG Runs can run in parallel for the 
same DAG, and each has a defined data interval, which identifies the *logical* 
date and time range it is running for - not the *actual* time when it was 
started.
 
 As an example of why this is useful, consider writing a DAG that processes a 
daily set of experimental data. It's been rewritten, and you want to run it on 
the previous 3 months of data - no problem, since Airflow can *backfill* the 
DAG and run copies of it for every day in those previous 3 months, all at once.
 
-Those DAG Runs will all have been started on the same actual day, but their 
``execution_date`` values will cover those last 3 months, and that's what all 
the tasks, operators and sensors inside the DAG look at when they run.
+Those DAG Runs will all have been started on the same actual day, but their 
data intervals will cover those last 3 months, and that's what all the tasks, 
operators and sensors inside the DAG look at when they run.
 
 In much the same way a DAG instantiates into a DAG Run every time it's run, 
Tasks specified inside a DAG also instantiate into :ref:`Task Instances 
<concepts:task-instances>` along with it.
 
+.. seealso::
+
+    :doc:`Data Intervals <./data-interval>`
+
 
 DAG Assignment
 --------------
diff --git a/docs/apache-airflow/dag-run.rst b/docs/apache-airflow/dag-run.rst
index 5d47a0b..6bbe5e0 100644
--- a/docs/apache-airflow/dag-run.rst
+++ b/docs/apache-airflow/dag-run.rst
@@ -54,17 +54,31 @@ Cron Presets
 Your DAG will be instantiated for each schedule along with a corresponding
 DAG Run entry in the database backend.
 
-.. note::
+Data Interval
+-------------
+
+Each DAG run in Airflow has an assigned "data interval" that represents the 
time
+range it operates in. For a DAG scheduled with ``@daily``, for example, each of
+its data interval would start at midnight of each day, and end at midnight of
+the next day.
+
+A DAG run happens *after* its associated data interval has ended, to ensure the
+run is able to collect all the actual data within the time period. Therefore, a
+run covering the data period of 2020-01-01 will not start to run until
+2020-01-01 has ended, i.e. 2020-01-02 onwards.
+
+All dates in Airflow are tied to the data interval concept in some way. The
+"logical date" (also called ``execution_date`` from previous Airflow version)
+of a DAG run, for example, usually denotes the start of the data interval, not
+when the DAG is actually executed. Similarly, since the ``start_date`` argument
+for the DAG and its tasks points to the same logical date, a run will only
+be created after that data interval ends. So a DAG with ``@daily`` schedule and
+``start_date`` of 2020-01-01, for example, will not be created until 
2020-01-02.
 
-    If you run a DAG on a schedule_interval of one day, the run stamped 
2020-01-01
-    will be triggered soon after 2020-01-01T23:59. In other words, the job 
instance is
-    started once the period it covers has ended.  The ``execution_date`` 
available in the context
-    will also be 2020-01-01.
+.. tip::
 
-    The first DAG Run is created based on the minimum ``start_date`` for the 
tasks in your DAG.
-    Subsequent DAG Runs are created by the scheduler process, based on your 
DAG’s ``schedule_interval``,
-    sequentially. If your start_date is 2020-01-01 and schedule_interval is 
@daily, the first run
-    will be created on 2020-01-02 i.e., after your start date has passed.
+    If ``schedule_interval`` is not enough to express your DAG's schedule,
+    logical date, or data interval, see :doc:`Customizing imetables 
</howto/timetable>`.
 
 Re-run DAG
 ''''''''''
diff --git a/docs/apache-airflow/howto/index.rst 
b/docs/apache-airflow/howto/index.rst
index efd5c48..9fb80fb 100644
--- a/docs/apache-airflow/howto/index.rst
+++ b/docs/apache-airflow/howto/index.rst
@@ -33,6 +33,7 @@ configuring an Airflow environment.
     set-config
     set-up-database
     operator/index
+    timetable
     customize-state-colors-ui
     customize-dag-ui-page-instance-name
     custom-operator
diff --git a/docs/apache-airflow/howto/timetable.rst 
b/docs/apache-airflow/howto/timetable.rst
new file mode 100644
index 0000000..2c9ebb3
--- /dev/null
+++ b/docs/apache-airflow/howto/timetable.rst
@@ -0,0 +1,63 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+
+Customizing DAG Scheduling with Timetables
+==========================================
+
+A DAG's scheduling strategy is determined by its internal "timetable". This
+timetable can be created by specifying the DAG's ``schedule_interval`` 
argument,
+as described in :doc:`DAG Run </dag-run>`. The timetable also dictates the data
+interval and the logical time of each run created for the DAG.
+
+However, there are situations when a cron expression or simple ``timedelta``
+periods cannot properly express the schedule. Some of the examples are:
+
+* Data intervals with "holes" between. (Instead of continous, as both the cron
+  expression and ``timedelta`` schedules represent.)
+* Run tasks on different times each day. For example, an astronomer may find it
+  useful to run a task on each sunset, to process data collected from the
+  previous sunlight period.
+* Schedules not following the Gregorian calendar. For example, create a run for
+  each month in the `Traditional Chinese Calendar`_. This is conceptually
+  similar to the sunset case above, but for a different time scale.
+* Rolling windows, or overlapping data intervals. For example, one may want to
+  have a run each day, but make each run cover the period of the previous seven
+  days. It is possible to "hack" this with a cron expression, but a custom data
+  interval would task specification more natural.
+
+.. _`Traditional Chinese Calendar`: 
https://en.wikipedia.org/wiki/Chinese_calendar
+
+
+For our example, let's say a company may want to run a job after each weekday,
+to process data collected during the work day. The first intuitively answer
+to this would be ``schedule_interval="0 0 * * 1-5"`` (midnight on Monday to
+Friday), but this means data collected on Friday will *not* be processed right
+after Friday, but on the next Monday, and that run's interval would be from
+midnight Friday to midnight *Monday*.
+
+This is, therefore, a case of the "holes" category; the intended schedule 
should
+leave the two weekend days. What we want is:
+
+* Schedule a run for each Monday, Tuesday, Wednesday, Thursday, and Friday. The
+  run's data interval would cover from the midnight of each day, to the 
midnight
+  of the next day.
+* Each run would be created right after the data interval ends. The run 
covering
+  Monday happens on midnight Tuesday and so on. The run covering Friday happens
+  on midnight Saturday. No runs happen on midnights Sunday and Monday.
+
+TODO...

Reply via email to