This is an automated email from the ASF dual-hosted git repository. uranusjr pushed a commit to branch aip-39-docs in repository https://gitbox.apache.org/repos/asf/airflow.git
commit bce29bc40eaece4cb3d2a569a4f1984898867f5d Author: Tzu-ping Chung <[email protected]> AuthorDate: Wed Aug 11 22:29:08 2021 +0800 WIP --- docs/apache-airflow/concepts/dags.rst | 14 ++++++-- docs/apache-airflow/dag-run.rst | 32 ++++++++++++----- docs/apache-airflow/howto/index.rst | 1 + docs/apache-airflow/howto/timetable.rst | 63 +++++++++++++++++++++++++++++++++ 4 files changed, 99 insertions(+), 11 deletions(-) diff --git a/docs/apache-airflow/concepts/dags.rst b/docs/apache-airflow/concepts/dags.rst index c564ef8..acbf24c 100644 --- a/docs/apache-airflow/concepts/dags.rst +++ b/docs/apache-airflow/concepts/dags.rst @@ -148,14 +148,24 @@ The ``schedule_interval`` argument takes any value that is a valid `Crontab <htt with DAG("my_daily_dag", schedule_interval="0 * * * *"): ... -Every time you run a DAG, you are creating a new instance of that DAG which Airflow calls a :doc:`DAG Run </dag-run>`. DAG Runs can run in parallel for the same DAG, and each has a defined ``execution_date``, which identifies the *logical* date and time it is running for - not the *actual* time when it was started. +.. tip:: + + For more information on ``schedule_interval`` values, see :doc:`DAG Run </dag-run>`. + + If ``schedule_interval`` is not enough to express the DAG's schedule, see :doc:`Timetables </howto/timetable>`. + +Every time you run a DAG, you are creating a new instance of that DAG which Airflow calls a :doc:`DAG Run </dag-run>`. DAG Runs can run in parallel for the same DAG, and each has a defined data interval, which identifies the *logical* date and time range it is running for - not the *actual* time when it was started. As an example of why this is useful, consider writing a DAG that processes a daily set of experimental data. It's been rewritten, and you want to run it on the previous 3 months of data - no problem, since Airflow can *backfill* the DAG and run copies of it for every day in those previous 3 months, all at once. -Those DAG Runs will all have been started on the same actual day, but their ``execution_date`` values will cover those last 3 months, and that's what all the tasks, operators and sensors inside the DAG look at when they run. +Those DAG Runs will all have been started on the same actual day, but their data intervals will cover those last 3 months, and that's what all the tasks, operators and sensors inside the DAG look at when they run. In much the same way a DAG instantiates into a DAG Run every time it's run, Tasks specified inside a DAG also instantiate into :ref:`Task Instances <concepts:task-instances>` along with it. +.. seealso:: + + :doc:`Data Intervals <./data-interval>` + DAG Assignment -------------- diff --git a/docs/apache-airflow/dag-run.rst b/docs/apache-airflow/dag-run.rst index 5d47a0b..6bbe5e0 100644 --- a/docs/apache-airflow/dag-run.rst +++ b/docs/apache-airflow/dag-run.rst @@ -54,17 +54,31 @@ Cron Presets Your DAG will be instantiated for each schedule along with a corresponding DAG Run entry in the database backend. -.. note:: +Data Interval +------------- + +Each DAG run in Airflow has an assigned "data interval" that represents the time +range it operates in. For a DAG scheduled with ``@daily``, for example, each of +its data interval would start at midnight of each day, and end at midnight of +the next day. + +A DAG run happens *after* its associated data interval has ended, to ensure the +run is able to collect all the actual data within the time period. Therefore, a +run covering the data period of 2020-01-01 will not start to run until +2020-01-01 has ended, i.e. 2020-01-02 onwards. + +All dates in Airflow are tied to the data interval concept in some way. The +"logical date" (also called ``execution_date`` from previous Airflow version) +of a DAG run, for example, usually denotes the start of the data interval, not +when the DAG is actually executed. Similarly, since the ``start_date`` argument +for the DAG and its tasks points to the same logical date, a run will only +be created after that data interval ends. So a DAG with ``@daily`` schedule and +``start_date`` of 2020-01-01, for example, will not be created until 2020-01-02. - If you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01 - will be triggered soon after 2020-01-01T23:59. In other words, the job instance is - started once the period it covers has ended. The ``execution_date`` available in the context - will also be 2020-01-01. +.. tip:: - The first DAG Run is created based on the minimum ``start_date`` for the tasks in your DAG. - Subsequent DAG Runs are created by the scheduler process, based on your DAG’s ``schedule_interval``, - sequentially. If your start_date is 2020-01-01 and schedule_interval is @daily, the first run - will be created on 2020-01-02 i.e., after your start date has passed. + If ``schedule_interval`` is not enough to express your DAG's schedule, + logical date, or data interval, see :doc:`Customizing imetables </howto/timetable>`. Re-run DAG '''''''''' diff --git a/docs/apache-airflow/howto/index.rst b/docs/apache-airflow/howto/index.rst index efd5c48..9fb80fb 100644 --- a/docs/apache-airflow/howto/index.rst +++ b/docs/apache-airflow/howto/index.rst @@ -33,6 +33,7 @@ configuring an Airflow environment. set-config set-up-database operator/index + timetable customize-state-colors-ui customize-dag-ui-page-instance-name custom-operator diff --git a/docs/apache-airflow/howto/timetable.rst b/docs/apache-airflow/howto/timetable.rst new file mode 100644 index 0000000..2c9ebb3 --- /dev/null +++ b/docs/apache-airflow/howto/timetable.rst @@ -0,0 +1,63 @@ + .. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + + +Customizing DAG Scheduling with Timetables +========================================== + +A DAG's scheduling strategy is determined by its internal "timetable". This +timetable can be created by specifying the DAG's ``schedule_interval`` argument, +as described in :doc:`DAG Run </dag-run>`. The timetable also dictates the data +interval and the logical time of each run created for the DAG. + +However, there are situations when a cron expression or simple ``timedelta`` +periods cannot properly express the schedule. Some of the examples are: + +* Data intervals with "holes" between. (Instead of continous, as both the cron + expression and ``timedelta`` schedules represent.) +* Run tasks on different times each day. For example, an astronomer may find it + useful to run a task on each sunset, to process data collected from the + previous sunlight period. +* Schedules not following the Gregorian calendar. For example, create a run for + each month in the `Traditional Chinese Calendar`_. This is conceptually + similar to the sunset case above, but for a different time scale. +* Rolling windows, or overlapping data intervals. For example, one may want to + have a run each day, but make each run cover the period of the previous seven + days. It is possible to "hack" this with a cron expression, but a custom data + interval would task specification more natural. + +.. _`Traditional Chinese Calendar`: https://en.wikipedia.org/wiki/Chinese_calendar + + +For our example, let's say a company may want to run a job after each weekday, +to process data collected during the work day. The first intuitively answer +to this would be ``schedule_interval="0 0 * * 1-5"`` (midnight on Monday to +Friday), but this means data collected on Friday will *not* be processed right +after Friday, but on the next Monday, and that run's interval would be from +midnight Friday to midnight *Monday*. + +This is, therefore, a case of the "holes" category; the intended schedule should +leave the two weekend days. What we want is: + +* Schedule a run for each Monday, Tuesday, Wednesday, Thursday, and Friday. The + run's data interval would cover from the midnight of each day, to the midnight + of the next day. +* Each run would be created right after the data interval ends. The run covering + Monday happens on midnight Tuesday and so on. The run covering Friday happens + on midnight Saturday. No runs happen on midnights Sunday and Monday. + +TODO...
