This is an automated email from the ASF dual-hosted git repository. kaxilnaik pushed a commit to branch v1-10-test in repository https://gitbox.apache.org/repos/asf/airflow.git
commit 46e08d94f4527a6daf0b5e751b5ce785d8c82721 Author: Kartik Khare <[email protected]> AuthorDate: Wed Nov 27 16:43:10 2019 +0530 [AIRFLOW-XXX] GSoD: Adding Task re-run documentation (#6295) (cherry picked from commit ac2d0bedf2460beb03e5853ce6ad214c0bda9d58) --- docs/dag-run.rst | 196 +++++++++++++++++++++++++++++++++++++++++++++++++++++ docs/index.rst | 1 + docs/scheduler.rst | 184 +++++++------------------------------------------ 3 files changed, 222 insertions(+), 159 deletions(-) diff --git a/docs/dag-run.rst b/docs/dag-run.rst new file mode 100644 index 0000000..313ac8e --- /dev/null +++ b/docs/dag-run.rst @@ -0,0 +1,196 @@ + .. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +DAG Runs +========= +A DAG Run is an object representing an instantiation of the DAG in time. + +Each DAG may or may not have a schedule, which informs how DAG Runs are +created. ``schedule_interval`` is defined as a DAG argument, and receives +preferably a +`cron expression <https://en.wikipedia.org/wiki/Cron#CRON_expression>`_ as +a ``str``, or a ``datetime.timedelta`` object. + +.. tip:: + You can use an online editor for CRON expressions such as `Crontab guru <https://crontab.guru/>`_ + +Alternatively, you can also use one of these cron "presets": + ++----------------+----------------------------------------------------------------+-----------------+ +| preset | meaning | cron | ++================+================================================================+=================+ +| ``None`` | Don't schedule, use for exclusively "externally triggered" | | +| | DAGs | | ++----------------+----------------------------------------------------------------+-----------------+ +| ``@once`` | Schedule once and only once | | ++----------------+----------------------------------------------------------------+-----------------+ +| ``@hourly`` | Run once an hour at the beginning of the hour | ``0 * * * *`` | ++----------------+----------------------------------------------------------------+-----------------+ +| ``@daily`` | Run once a day at midnight | ``0 0 * * *`` | ++----------------+----------------------------------------------------------------+-----------------+ +| ``@weekly`` | Run once a week at midnight on Sunday morning | ``0 0 * * 0`` | ++----------------+----------------------------------------------------------------+-----------------+ +| ``@monthly`` | Run once a month at midnight of the first day of the month | ``0 0 1 * *`` | ++----------------+----------------------------------------------------------------+-----------------+ +| ``@quarterly`` | Run once a quarter at midnight on the first day | ``0 0 1 */3 *`` | ++----------------+----------------------------------------------------------------+-----------------+ +| ``@yearly`` | Run once a year at midnight of January 1 | ``0 0 1 1 *`` | ++----------------+----------------------------------------------------------------+-----------------+ + +Your DAG will be instantiated for each schedule along with a corresponding +DAG Run entry in the database backend. + +.. note:: + + If you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01 + will be triggered soon after 2020-01-01T23:59. In other words, the job instance is + started once the period it covers has ended. The ``execution_date`` available in the context + will also be 2020-01-01. + + The first DAG Run is created based on the minimum ``start_date`` for the tasks in your DAG. + Subsequent DAG Runs are created by the scheduler process, based on your DAG’s ``schedule_interval``, + sequentially. If your start_date is 2020-01-01 and schedule_interval is @daily, the first run + will be created on 2020-01-02 i.e., after your start date has passed. + +Re-run DAG +'''''''''' +There can be cases where you will want to execute your DAG again. One such case is when the scheduled +DAG run fails. + +.. _dag-catchup: + +Catchup +------- + +An Airflow DAG with a ``start_date``, possibly an ``end_date``, and a ``schedule_interval`` defines a +series of intervals which the scheduler turns into individual DAG Runs and executes. The scheduler, by default, will +kick off a DAG Run for any interval that has not been run since the last execution date (or has been cleared). This concept is called Catchup. + +If your DAG is written to handle its catchup (i.e., not limited to the interval, but instead to ``Now`` for instance.), +then you will want to turn catchup off. This can be done by setting ``catchup = False`` in DAG or ``catchup_by_default = False`` +in the configuration file. When turned off, the scheduler creates a DAG run only for the latest interval. + +.. code:: python + + """ + Code that goes along with the Airflow tutorial located at: + https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py + """ + from airflow import DAG + from airflow.operators.bash_operator import BashOperator + from datetime import datetime, timedelta + + + default_args = { + 'owner': 'Airflow', + 'depends_on_past': False, + 'email': ['[email protected]'], + 'email_on_failure': False, + 'email_on_retry': False, + 'retries': 1, + 'retry_delay': timedelta(minutes=5) + } + + dag = DAG( + 'tutorial', + default_args=default_args, + start_date=datetime(2015, 12, 1), + description='A simple tutorial DAG', + schedule_interval='@daily', + catchup=False) + +In the example above, if the DAG is picked up by the scheduler daemon on 2016-01-02 at 6 AM, +(or from the command line), a single DAG Run will be created, with an `execution_date` of 2016-01-01, +and the next one will be created just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02. + +If the ``dag.catchup`` value had been ``True`` instead, the scheduler would have created a DAG Run +for each completed interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, +as that interval hasn’t completed) and the scheduler will execute them sequentially. + +Catchup is also triggered when you turn off a DAG for a specified period and then re-enable it. + +This behavior is great for atomic datasets that can easily be split into periods. Turning catchup off is great +if your DAG performs catchup internally. + + +Backfill +--------- +There can be the case when you may want to run the dag for a specified historical period e.g., +A data filling DAG is created with ``start_date`` **2019-11-21**, but another user requires the output data from a month ago i.e., **2019-10-21**. +This process is known as Backfill. + +You may want to backfill the data even in the cases when catchup is disabled. This can be done through CLI. +Run the below command + +.. code:: bash + + airflow backfill -s START_DATE -e END_DATE dag_id + +The `backfill command <cli-ref.html#backfill>`_ will re-run all the instances of the dag_id for all the intervals within the start date and end date. + +Re-run Tasks +------------ +Some of the tasks can fail during the scheduled run. Once you have fixed +the errors after going through the logs, you can re-run the tasks by clearing it for the +scheduled date. Clearing a task instance doesn't delete the task instance record. +Instead, it updates ``max_tries`` to ``0`` and set the current task instance state to be ``None``, this forces the task to re-run. + +Click on the failed task in the Tree or Graph views and then click on **Clear**. +The executor will re-run it. + +There are multiple options you can select to re-run - + +* **Past** - All the instances of the task in the runs before the current DAG's execution date +* **Future** - All the instances of the task in the runs after the current DAG's execution date +* **Upstream** - The upstream tasks in the current DAG +* **Downstream** - The downstream tasks in the current DAG +* **Recursive** - All the tasks in the child DAGs and parent DAGs +* **Failed** - Only the failed tasks in the current DAG + +You can also clear the task through CLI using the command: + +.. code:: bash + + airflow clear dag_id -t task_regex -s START_DATE -d END_DATE + +For the specified ``dag_id`` and time interval, the command clears all instances of the tasks matching the regex. +For more options, you can check the help of the `clear command <cli-ref.html#clear>`_ : + +.. code:: bash + + airflow clear -h + +External Triggers +''''''''''''''''' + +Note that DAG Runs can also be created manually through the CLI. Just run the command - + +.. code:: bash + + airflow trigger_dag -e execution_date run_id + +The DAG Runs created externally to the scheduler get associated with the trigger’s timestamp and are displayed +in the UI alongside scheduled DAG runs. The execution date passed inside the DAG can be specified using the ``-e`` argument. +The default is the current date in the UTC timezone. + +In addition, you can also manually trigger a DAG Run using the web UI (tab **DAGs** -> column **Links** -> button **Trigger Dag**) + +To Keep in Mind +'''''''''''''''' +* Marking task instances as failed can be done through the UI. This can be used to stop running task instances. +* Marking task instances as successful can be done through the UI. This is mostly to fix false negatives, or + for instance, when the fix has been applied outside of Airflow. diff --git a/docs/index.rst b/docs/index.rst index 44717ac..3a0ea91 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -85,6 +85,7 @@ Content profiling scheduler executor/index + dag-run plugins security timezone diff --git a/docs/scheduler.rst b/docs/scheduler.rst index d4eddf3..974a934 100644 --- a/docs/scheduler.rst +++ b/docs/scheduler.rst @@ -15,13 +15,11 @@ specific language governing permissions and limitations under the License. +Scheduler +========== - -Scheduling & Triggers -===================== - -The Airflow scheduler monitors all tasks and all DAGs, and triggers the -task instances whose dependencies have been met. Behind the scenes, +The Airflow scheduler monitors all tasks and all DAGs and triggers the +Task instances whose dependencies have been met. Behind the scenes, it spins up a subprocess, which monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) collects DAG parsing results and inspects active tasks to see whether @@ -29,21 +27,10 @@ they can be triggered. The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To kick it off, all you need to do is -execute ``airflow scheduler``. It will use the configuration specified in +execute the ``airflow scheduler`` command. It uses the configuration specified in ``airflow.cfg``. -Note that if you run a DAG on a ``schedule_interval`` of one day, -the run stamped ``2016-01-01`` will be trigger soon after ``2016-01-01T23:59``. -In other words, the job instance is started once the period it covers -has ended. - -**Let's Repeat That** The scheduler runs your job one ``schedule_interval`` AFTER the -start date, at the END of the period. - -The scheduler starts an instance of the executor specified in the your -``airflow.cfg``. If it happens to be the :class:`airflow.contrib.executors.local_executor.LocalExecutor`, tasks will be -executed as subprocesses; in the case of :class:`airflow.executors.celery_executor.CeleryExecutor`, :class:`airflow.executors.dask_executor.DaskExecutor``, and -:class:`airflow.contrib.executors.mesos_executor.MesosExecutor`, tasks are executed remotely. +The scheduler uses the configured :doc:`Executor </executor/index>` to run tasks that are ready. To start a scheduler, simply run the command: @@ -51,144 +38,23 @@ To start a scheduler, simply run the command: airflow scheduler +Your DAGs will start executing once the scheduler is running successfully. + +.. note:: + + The first DAG Run is created based on the minimum ``start_date`` for the tasks in your DAG. + Subsequent DAG Runs are created by the scheduler process, based on your DAG’s ``schedule_interval``, + sequentially. + + +The scheduler won't trigger your tasks until the period it covers has ended e.g., A job with ``schedule_interval`` set as ``@daily`` runs after the day +has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed. +In the UI, it appears as if Airflow is running your tasks a day **late** + +.. note:: + + If you run a DAG on a ``schedule_interval`` of one day, the run with ``execution_date`` ``2019-11-21`` triggers soon after ``2019-11-21T23:59``. + + **Let’s Repeat That**, the scheduler runs your job one ``schedule_interval`` AFTER the start date, at the END of the period. -DAG Runs -'''''''' - -A DAG Run is an object representing an instantiation of the DAG in time. - -Each DAG may or may not have a schedule, which informs how ``DAG Runs`` are -created. ``schedule_interval`` is defined as a DAG arguments, and receives -preferably a -`cron expression <https://en.wikipedia.org/wiki/Cron#CRON_expression>`_ as -a ``str``, or a ``datetime.timedelta`` object. Alternatively, you can also -use one of these cron "preset": - -+----------------+----------------------------------------------------------------+-----------------+ -| preset | meaning | cron | -+================+================================================================+=================+ -| ``None`` | Don't schedule, use for exclusively "externally triggered" | | -| | DAGs | | -+----------------+----------------------------------------------------------------+-----------------+ -| ``@once`` | Schedule once and only once | | -+----------------+----------------------------------------------------------------+-----------------+ -| ``@hourly`` | Run once an hour at the beginning of the hour | ``0 * * * *`` | -+----------------+----------------------------------------------------------------+-----------------+ -| ``@daily`` | Run once a day at midnight | ``0 0 * * *`` | -+----------------+----------------------------------------------------------------+-----------------+ -| ``@weekly`` | Run once a week at midnight on Sunday morning | ``0 0 * * 0`` | -+----------------+----------------------------------------------------------------+-----------------+ -| ``@monthly`` | Run once a month at midnight of the first day of the month | ``0 0 1 * *`` | -+----------------+----------------------------------------------------------------+-----------------+ -| ``@quarterly`` | Run once a quarter at midnight on the first day | ``0 0 1 */3 *`` | -+----------------+----------------------------------------------------------------+-----------------+ -| ``@yearly`` | Run once a year at midnight of January 1 | ``0 0 1 1 *`` | -+----------------+----------------------------------------------------------------+-----------------+ - -**Note**: Use ``schedule_interval=None`` and not ``schedule_interval='None'`` when -you don't want to schedule your DAG. - -Your DAG will be instantiated -for each schedule, while creating a ``DAG Run`` entry for each schedule. - -DAG runs have a state associated to them (running, failed, success) and -informs the scheduler on which set of schedules should be evaluated for -task submissions. Without the metadata at the DAG run level, the Airflow -scheduler would have much more work to do in order to figure out what tasks -should be triggered and come to a crawl. It might also create undesired -processing when changing the shape of your DAG, by say adding in new -tasks. - -Backfill and Catchup -'''''''''''''''''''' - -An Airflow DAG with a ``start_date``, possibly an ``end_date``, and a ``schedule_interval`` defines a -series of intervals which the scheduler turn into individual Dag Runs and execute. A key capability of -Airflow is that these DAG Runs are atomic, idempotent items, and the scheduler, by default, will examine -the lifetime of the DAG (from start to end/now, one interval at a time) and kick off a DAG Run for any -interval that has not been run (or has been cleared). This concept is called Catchup. - -If your DAG is written to handle its own catchup (IE not limited to the interval, but instead to "Now" -for instance.), then you will want to turn catchup off (Either on the DAG itself with ``dag.catchup = -False``) or by default at the configuration file level with ``catchup_by_default = False``. What this -will do, is to instruct the scheduler to only create a DAG Run for the most current instance of the DAG -interval series. - -.. code:: python - - """ - Code that goes along with the Airflow tutorial located at: - https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py - """ - from airflow import DAG - from airflow.operators.bash_operator import BashOperator - from datetime import datetime, timedelta - - - default_args = { - 'owner': 'Airflow', - 'depends_on_past': False, - 'start_date': datetime(2015, 12, 1), - 'email': ['[email protected]'], - 'email_on_failure': False, - 'email_on_retry': False, - 'retries': 1, - 'retry_delay': timedelta(minutes=5), - 'schedule_interval': '@daily', - } - - dag = DAG('tutorial', catchup=False, default_args=default_args) - -In the example above, if the DAG is picked up by the scheduler daemon on 2016-01-02 at 6 AM, (or from the -command line), a single DAG Run will be created, with an ``execution_date`` of 2016-01-01, and the next -one will be created just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02. - -If the ``dag.catchup`` value had been True instead, the scheduler would have created a DAG Run for each -completed interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, as that interval -hasn't completed) and the scheduler will execute them sequentially. This behavior is great for atomic -datasets that can easily be split into periods. Turning catchup off is great if your DAG Runs perform -backfill internally. - -External Triggers -''''''''''''''''' - -Note that ``DAG Runs`` can also be created manually through the CLI while -running an ``airflow trigger_dag`` command, where you can define a -specific ``run_id``. The ``DAG Runs`` created externally to the -scheduler get associated to the trigger's timestamp, and will be displayed -in the UI alongside scheduled ``DAG runs``. - - -To Keep in Mind -''''''''''''''' -* The first ``DAG Run`` is created based on the minimum ``start_date`` for the - tasks in your DAG. -* Subsequent ``DAG Runs`` are created by the scheduler process, based on - your DAG's ``schedule_interval``, sequentially. -* When clearing a set of tasks' state in hope of getting them to re-run, - it is important to keep in mind the ``DAG Run``'s state too as it defines - whether the scheduler should look into triggering tasks for that run. - -Here are some of the ways you can **unblock tasks**: - -* From the UI, you can **clear** (as in delete the status of) individual task instances - from the task instances dialog, while defining whether you want to includes the past/future - and the upstream/downstream dependencies. Note that a confirmation window comes next and - allows you to see the set you are about to clear. You can also clear all task instances - associated with the dag. -* The CLI command ``airflow clear -h`` has lots of options when it comes to clearing task instance - states, including specifying date ranges, targeting task_ids by specifying a regular expression, - flags for including upstream and downstream relatives, and targeting task instances in specific - states (``failed``, or ``success``) -* Clearing a task instance will no longer delete the task instance record. Instead it updates - max_tries and set the current task instance state to be None. -* Marking task instances as failed can be done through the UI. This can be used to stop running task instances. -* Marking task instances as successful can be done through the UI. This is mostly to fix false negatives, - or for instance when the fix has been applied outside of Airflow. -* The ``airflow backfill`` CLI subcommand has a flag to ``--mark_success`` and allows selecting - subsections of the DAG as well as specifying date ranges. - -If you want to use 'external trigger' to run future-dated execution dates, set ``allow_trigger_in_future = True`` in ``scheduler`` section in ``airflow.cfg``. -This only has effect if your DAG has no ``schedule_interval``. -If you keep default ``allow_trigger_in_future = False`` and try 'external trigger' to run future-dated execution dates, -the scheduler won't execute it now but the scheduler will execute it in the future once the current date rolls over to the execution date. + You should refer :doc:`dag-run` for details on scheduling a DAG.
