[GitHub] [airflow] jedcunningham commented on a diff in pull request #26208: Update docs for data-aware scheduling (AKA "datasets"/AIP-48)

GitBox Wed, 07 Sep 2022 12:00:15 -0700


jedcunningham commented on code in PR #26208:
URL: https://github.com/apache/airflow/pull/26208#discussion_r965178315



##########
docs/apache-airflow/concepts/datasets.rst:
##########
@@ -15,33 +15,185 @@
     specific language governing permissions and limitations
     under the License.
 
-Datasets
-========
+Data-aware scheduling
+=====================
 
 .. versionadded:: 2.4
 
-With datasets, instead of running a DAG on a schedule, a DAG can be configured 
to run when a dataset has been updated.
+Quickstart
+----------
 
-To use this feature, define a dataset:
+In addition to scheduling DAGs based upon time, they can also be scheduled 
based upon a task updating a dataset.
 
-.. exampleinclude:: /../../airflow/example_dags/example_datasets.py
-    :language: python
-    :start-after: [START dataset_def]
-    :end-before: [END dataset_def]
+.. code-block:: python
 
-Then reference the dataset as a task outlet:
+    from airflow import Dataset
 
-.. exampleinclude:: /../../airflow/example_dags/example_datasets.py
-    :language: python
-    :dedent: 4
-    :start-after: [START task_outlet]
-    :end-before: [END task_outlet]
+    with DAG(...):
+        MyOperator(
+            # this task updates example.csv
+            outlets=[Dataset("s3://dataset-bucket/example.csv")],
+            ...,
+        )
 
-Finally, define a DAG and reference this dataset in the DAG's ``schedule`` 
argument:
 
-.. exampleinclude:: /../../airflow/example_dags/example_datasets.py
-    :language: python
-    :start-after: [START dag_dep]
-    :end-before: [END dag_dep]
+    with DAG(
+        # this DAG should be run when example.csv is updated (by dag1)
+        schedule=[Dataset("s3://dataset-bucket/example.csv")],
+        ...,
+    ):
+        ...
 
-You can reference multiple datasets in the DAG's ``schedule`` argument.  Once 
there has been an update to all of the upstream datasets, the DAG will be 
triggered.  This means that the DAG will run as frequently as its 
least-frequently-updated dataset.
+What is a "dataset"?
+--------------------
+
+An Airflow dataset is a stand-in for a logical grouping of data. Datasets may 
be updated by upstream "producer" tasks, and dataset updates contribute to 
scheduling downstream "consumer" DAGs.
+
+A dataset is defined by a Uniform Resource Identifier (URI):
+
+.. code-block:: python
+
+    from airflow import Dataset
+
+    example_dataset = Dataset('s3://dataset-bucket/example.csv')
+
+Airflow makes no assumptions about the content or location of the data 
represented by the identifier. It is treated as a string, so any use of regular 
expressions (eg ``input_\d+.csv``) or file glob patterns (eg 
``input_2022*.csv``) as an attempt to create multiple datasets from one 
declaration will not work.

Review Comment:
   ```suggestion
   Airflow makes no assumptions about the content or location of the data 
represented by the URI. It is treated as a string, so any use of regular 
expressions (eg ``input_\d+.csv``) or file glob patterns (eg 
``input_2022*.csv``) as an attempt to create multiple datasets from one 
declaration will not work.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] jedcunningham commented on a diff in pull request #26208: Update docs for data-aware scheduling (AKA "datasets"/AIP-48)

Reply via email to