This is an automated email from the ASF dual-hosted git repository.
potiuk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git
The following commit(s) were added to refs/heads/main by this push:
new 02ce3a0238 Edited timetable docs (#38505)
02ce3a0238 is described below
commit 02ce3a0238ef0c170b2121a3ce80b77f54fa1071
Author: Laura Zdanski <[email protected]>
AuthorDate: Thu Apr 4 07:00:09 2024 -0400
Edited timetable docs (#38505)
---------
Co-authored-by: Collin McNulty <[email protected]>
Co-authored-by: Tzu-ping Chung <[email protected]>
---
.../authoring-and-scheduling/timetable.rst | 135 +++++++++++----------
1 file changed, 72 insertions(+), 63 deletions(-)
diff --git a/docs/apache-airflow/authoring-and-scheduling/timetable.rst
b/docs/apache-airflow/authoring-and-scheduling/timetable.rst
index 78234910a6..9ca7d436db 100644
--- a/docs/apache-airflow/authoring-and-scheduling/timetable.rst
+++ b/docs/apache-airflow/authoring-and-scheduling/timetable.rst
@@ -19,9 +19,9 @@
Timetables
==========
-For DAGs with time-based schedules (as opposed to event-driven), the scheduling
-decisions are driven by its internal "timetable". The timetable also
-determines the data interval and the logical date of each run created for the
DAG.
+For a DAG with a time-based schedule (as opposed to event-driven), the DAG's
internal "timetable"
+drives scheduling. The timetable also determines the data interval and the
logical date of
+each run created for the DAG.
DAGs scheduled with a cron expression or ``timedelta`` object are
internally converted to always use a timetable.
@@ -29,39 +29,39 @@ internally converted to always use a timetable.
If a cron expression or ``timedelta`` is sufficient for your use case, you
don't need
to worry about writing a custom timetable because Airflow has default
timetables that handle those cases.
But for more complicated scheduling requirements,
-you may create your own timetable class and pass that to the DAG's
``schedule`` argument.
+you can create your own timetable class and pass that to the DAG's
``schedule`` argument.
-Here are some examples of when custom timetable implementations are useful:
+Some examples of when custom timetable implementations are useful:
-* Data intervals with "holes" between. (Instead of continuous, as both the cron
- expression and ``timedelta`` schedules represent.)
-* Run tasks at different times each day. For example, an astronomer may find it
+* Task runs that occur at different times each day. For example, an astronomer
might find it
useful to run a task at dawn to process data collected from the previous
night-time period.
-* Schedules not following the Gregorian calendar. For example, create a run for
+* Schedules that don't follow the Gregorian calendar. For example, create a
run for
each month in the `Traditional Chinese Calendar`_. This is conceptually
- similar to the sunset case above, but for a different time scale.
-* Rolling windows, or overlapping data intervals. For example, one may want to
+ similar to the sunrise case, but for a different time scale.
+* Rolling windows, or overlapping data intervals. For example, you might want
to
have a run each day, but make each run cover the period of the previous seven
- days. It is possible to "hack" this with a cron expression, but a custom data
- interval would be a more natural representation.
+ days. It is possible to hack this with a cron expression, but a custom data
+ interval provides a more natural representation.
+* Data intervals with "holes" between intervals instead of a continuous
interval, as both the cron
+ expression and ``timedelta`` schedules represent continuous intervals. See
:ref:`data-interval`.
.. _`Traditional Chinese Calendar`:
https://en.wikipedia.org/wiki/Chinese_calendar
-As such, Airflow allows for custom timetables to be written in plugins and
used by
-DAGs. An example demonstrating a custom timetable can be found in the
+Airflow allows you to write custom timetables in plugins and used by
+DAGs. You can find an example demonstrating a custom timetable in the
:doc:`/howto/timetable` how-to guide.
.. note::
- As a general rule, always access Variables, Connections etc or anything
that would access
+ As a general rule, always access Variables, Connections, or anything else
that needs access to
the database as late as possible in your code. See
:ref:`best_practices/timetables`
for more best practices to follow.
Built-in Timetables
-------------------
-Airflow comes with several common timetables built in to cover the most common
use cases. Additional timetables
+Airflow comes with several common timetables built-in to cover the most common
use cases. Additional timetables
may be available in plugins.
.. _CronTriggerTimetable:
@@ -82,9 +82,8 @@ A timetable that accepts a cron expression, and triggers DAG
runs according to i
def example_dag():
pass
-It is also possible to provide a static data interval to the timetable. The
optional ``interval`` argument
-must be a :class:`datetime.timedelta` or
``dateutil.relativedelta.relativedelta``. If given, a triggered DAG
-run's data interval would span the specified duration, and *ends* with the
trigger time.
+You can also provide a static data interval to the timetable. The optional
``interval`` argument
+must be a :class:`datetime.timedelta` or
``dateutil.relativedelta.relativedelta``. When using these arguments, a
triggered DAG run's data interval spans the specified duration, and *ends* with
the trigger time.
.. code-block:: python
@@ -111,11 +110,11 @@ run's data interval would span the specified duration,
and *ends* with the trigg
DeltaDataIntervalTimetable
^^^^^^^^^^^^^^^^^^^^^^^^^^
-Schedules data intervals with a time delta. Can be selected by providing a
+A timetable that schedules data intervals with a time delta. You can select it
by providing a
:class:`datetime.timedelta` or ``dateutil.relativedelta.relativedelta`` to the
``schedule`` parameter of a DAG.
-This timetable is more focused on the data interval value and does not
necessarily align execution dates with
-arbitrary bounds such as start of day or of hour.
+This timetable focuses on the data interval value and does not necessarily
align execution dates with
+arbitrary bounds, such as the start of day or of hour.
.. seealso:: `Differences between the cron and delta data interval timetables`_
@@ -136,8 +135,8 @@ trigger points, and triggers a DAG run at the end of each
data interval.
.. seealso:: `Differences between the two cron timetables`_
.. seealso:: `Differences between the cron and delta data interval timetables`_
-This can be selected by providing a string that is a valid cron expression to
the ``schedule``
-parameter of a DAG as described in the :doc:`../core-concepts/dags`
documentation.
+Select this timetable by providing a valid cron expression as a string to the
``schedule``
+parameter of a DAG, as described in the :doc:`../core-concepts/dags`
documentation.
.. code-block:: python
@@ -148,13 +147,13 @@ parameter of a DAG as described in the
:doc:`../core-concepts/dags` documentatio
EventsTimetable
^^^^^^^^^^^^^^^
-Simply pass a list of ``datetime``\s for the DAG to run after. Useful for
timing based on sporting
-events, planned communication campaigns, and other schedules that are
arbitrary and irregular but predictable.
+Pass a list of ``datetime``\s for the DAG to run after. This can be useful for
timing based on sporting
+events, planned communication campaigns, and other schedules that are
arbitrary and irregular, but predictable.
-The list of events must be finite and of reasonable size as it must be loaded
every time the DAG is parsed. Optionally,
-the ``restrict_to_events`` flag can be used to force manual runs of the DAG to
use the time of the most recent (or very
-first) event for the data interval, otherwise manual runs will run with a
``data_interval_start`` and
-``data_interval_end`` equal to the time at which the manual run was begun. You
can also name the set of events using the
+The list of events must be finite and of reasonable size as it must be loaded
every time the DAG is parsed. Optionally, use
+the ``restrict_to_events`` flag to force manual runs of the DAG that use the
time of the most recent, or very
+first, event for the data interval. Otherwise, manual runs begin with a
``data_interval_start`` and
+``data_interval_end`` equal to the time at which the manual run started. You
can also name the set of events using the
``description`` parameter, which will be displayed in the Airflow UI.
.. code-block:: python
@@ -181,9 +180,9 @@ first) event for the data interval, otherwise manual runs
will run with a ``data
Dataset event based scheduling with time based scheduling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Combining conditional dataset expressions with time-based schedules enhances
scheduling flexibility:
+Combining conditional dataset expressions with time-based schedules enhances
scheduling flexibility.
-The ``DatasetOrTimeSchedule`` is a specialized timetable allowing for the
scheduling of DAGs based on both time-based schedules and dataset events. It
facilitates the creation of scheduled runs (as per traditional timetables) and
dataset-triggered runs, which operate independently.
+The ``DatasetOrTimeSchedule`` is a specialized timetable that allows for the
scheduling of DAGs based on both time-based schedules and dataset events. It
also facilitates the creation of both scheduled runs, as per traditional
timetables, and dataset-triggered runs, which operate independently.
This feature is particularly useful in scenarios where a DAG needs to run on
dataset updates and also at periodic intervals. It ensures that the workflow
remains responsive to data changes and consistently runs regular checks or
updates.
@@ -210,52 +209,62 @@ Here's an example of a DAG using
``DatasetOrTimeSchedule``:
Timetables comparisons
----------------------
-
.. _Differences between the two cron timetables:
Differences between the two cron timetables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-There are two timetables `CronTriggerTimetable`_ and
`CronDataIntervalTimetable`_ that accepts a cron expression.
-There are some differences between the two:
-- `CronTriggerTimetable`_ does not take care of *Data Interval*, while
`CronDataIntervalTimetable`_ does.
-- The time when a DAG run is triggered by `CronTriggerTimetable`_ is more
intuitive and more similar to what people
-expect cron to behave than that of `CronDataIntervalTimetable`_ (when
``catchup`` is ``False``).
+Airflow has two timetables `CronTriggerTimetable`_ and
`CronDataIntervalTimetable`_ that accept a cron expression.
+
+However, there are differences between the two:
+- `CronTriggerTimetable`_ does not address *Data Interval*, while
`CronDataIntervalTimetable`_ does.
+- The timestamp in the ``run_id``, the ``logical_date`` for
`CronTriggerTimetable`_ and `CronDataIntervalTimetable`_ are defined
differently based on how they handle the data interval, as described in
:ref:`timetables_run_id_logical_date`.
Whether taking care of *Data Interval*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`CronTriggerTimetable`_ *does not* care the idea of *data interval*. It means
the value of ``data_interval_start``,
-``data_interval_end`` and legacy ``execution_date`` are the same - the time
when a DAG run is triggered.
+`CronTriggerTimetable`_ *does not* include *data interval*. This means that
the value of ``data_interval_start`` and
+``data_interval_end`` (and the legacy ``execution_date``) are the same; the
time when a DAG run is triggered.
+
+However, `CronDataIntervalTimetable`_ *does* include *data interval*. This
means the value of
+``data_interval_start`` and ``data_interval_end`` (and legacy
``execution_date``) are different. ``data_interval_start`` is the time when a
+DAG run is triggered and ``data_interval_end`` is the end of the interval.
+
+*Catchup* behavior
+^^^^^^^^^^^^^^^^^^
+
+Whether you're using `CronTriggerTimetable`_ or `CronDataIntervalTimetable`_,
there is no difference when ``catchup`` is ``True``.
-On the other hand, `CronDataIntervalTimetable`_ *does* care the idea of *data
interval*. It means the value of
-``data_interval_start`` and ``data_interval_end`` (and legacy
``execution_date``) are different. They are the start
-and end of the interval respectively.
+You might want to use ``False`` for ``catchup`` for certain scenarios, to
prevent running unnecessary DAGs:
+- If you create a new DAG with a start date in the past, and don't want to run
DAGs for the past. If ``catchup`` is ``True``, Airflow runs all DAGs that would
have run in that time interval.
+- If you pause an existing DAG, and then restart it at a later date, and don't
want to If ``catchup`` is ``True``,
+
+In these scenarios, the ``logical_date`` in the ``run_id`` are based on how
`CronTriggerTimetable`_ or `CronDataIntervalTimetable`_ handle the data
interval.
+
+See :ref:`dag-catchup` for more information about how DAG runs are triggered
when using ``catchup``.
+
+.. _timetables_run_id_logical_date:
The time when a DAG run is triggered
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-There is no difference between the two when ``catchup`` is ``True``.
:ref:`dag-catchup` tells you how DAG runs are
-triggered when ``catchup`` is ``True``.
+`CronTriggerTimetable`_ and `CronDataIntervalTimetable`_ trigger DAG runs at
the same time. However, the timestamp for the ``run_id`` is different for each.
+
+- `CronTriggerTimetable`_ has a ``run_id`` timestamp, the ``logical_date``,
showing when DAG run is able to start.
+- `CronTriggerTimetable`_ and `CronDataIntervalTimetable`_ trigger DAG runs at
the same time. However, the timestamp for the ``run_id`` (``logical_date``) is
different for each.
-When ``catchup`` is ``False``, there is difference in how a new DAG run is
triggered. `CronTriggerTimetable`_ triggers
-a new DAG run *after* the current time, while `CronDataIntervalTimetable`_
does *before* the current time (assuming
-the value of ``start_date`` is past time).
+For example, suppose there is a cron expression ``@daily`` or ``0 0 * * *``,
which is scheduled to run at 12AM every day. If you enable DAGs using the two
timetables at 3PM on January
+31st,
+- `CronTriggerTimetable`_ triggers a new DAG run at 12AM on February 1st. The
``run_id`` timestamp is midnight, on February 1st.
+- `CronDataIntervalTimetable`_ immediately triggers a new DAG run, because a
DAG run for the daily time interval beginning at 12AM on January 31st did not
occur yet. The ``run_id`` timestamp is midnight, on January 31st, since that is
the beginning of the data interval.
-Here is an example showing how the first DAG run is triggered. Supposes there
is a cron expression ``@daily`` or
-``0 0 * * *``, which is aimed to run at 12AM every day. If you enable DAGs
using the two timetables at 3PM on January
-31st, `CronTriggerTimetable`_ will trigger a new DAG run at 12AM on February
1st. `CronDataIntervalTimetable`_, on the other
-hand, will immediately trigger a new DAG run which is supposed to trigger at
12AM on January 31st if the DAG had been
-enabled beforehand.
+This is another example showing the difference in the case of skipping DAG
runs.
-This is another example showing the difference in the case of skipping DAG
runs. Suppose there are two running DAGs
-using the two timetables with a cron expression ``@daily`` or ``0 0 * * *``.
If you pause the DAGs at 3PM on January
-31st and re-enable them at 3PM on February 2nd, `CronTriggerTimetable`_ skips
the DAG runs which are supposed to
-trigger on February 1st and 2nd. The next DAG run will be triggered at 12AM on
February 3rd. `CronDataIntervalTimetable`_,
-on the other hand, skips the DAG runs which are supposed to trigger on
February 1st only. A DAG run for February 2nd
-is immediately triggered after you re-enable the DAG.
+Suppose there are two running DAGs with a cron expression ``@daily`` or ``0 0
* * *`` that use the two different timetables. If you pause the DAGs at 3PM on
January 31st and re-enable them at 3PM on February 2nd,
+- `CronTriggerTimetable`_ skips the DAG runs that were supposed to trigger on
February 1st and 2nd. The next DAG run will be triggered at 12AM on February
3rd.
+- `CronDataIntervalTimetable`_ skips the DAG runs that were supposed to
trigger on February 1st only. A DAG run for February 2nd is immediately
triggered after you re-enable the DAG.
-By these examples, you see how `CronTriggerTimetable`_ triggers DAG runs is
more intuitive and more similar to what
+In these examples, you see how `CronTriggerTimetable`_ triggers DAG runs is
more intuitive and more similar to what
people expect cron to behave than how `CronDataIntervalTimetable`_ does.
@@ -265,8 +274,8 @@ Differences between the cron and delta data interval
timetables:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Choosing between `DeltaDataIntervalTimetable`_ and
`CronDataIntervalTimetable`_ depends on your use case.
-If you enable a DAG at 01:05 on February 1st, the following table summarizes
the DAG runs created (and the
-data interval that they cover), depending on 3 arguments: ``schedule``,
``start_date`` and ``catchup``.
+If you enable a DAG at 01:05 on February 1st, the following table summarizes
the DAG runs created and the
+data interval that they cover, depending on 3 arguments: ``schedule``,
``start_date`` and ``catchup``.
.. list-table::
:header-rows: 1