uranusjr commented on code in PR #38687:
URL: https://github.com/apache/airflow/pull/38687#discussion_r1548779735
##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -51,38 +51,38 @@ In addition to scheduling DAGs based upon time, they can
also be scheduled based
What is a "dataset"?
--------------------
-An Airflow dataset is a stand-in for a logical grouping of data. Datasets may
be updated by upstream "producer" tasks, and dataset updates contribute to
scheduling downstream "consumer" DAGs.
+An Airflow Dataset is a logical grouping of data. Upstream producer tasks can
update datasets, and dataset updates contribute to scheduling downstream
consumer DAGs.
-A dataset is defined by a Uniform Resource Identifier (URI):
+Uniform Resource Identifier (URI) define datasets:
.. code-block:: python
from airflow.datasets import Dataset
example_dataset = Dataset("s3://dataset-bucket/example.csv")
-Airflow makes no assumptions about the content or location of the data
represented by the URI. It is treated as a string, so any use of regular
expressions (eg ``input_\d+.csv``) or file glob patterns (eg
``input_2022*.csv``) as an attempt to create multiple datasets from one
declaration will not work.
+Airflow makes no assumptions about the content or location of the data
represented by the URI, and treats the URI like a string. This means that
Airflow treats any regular expressions, like ``input_\d+.csv``, or file glob
patterns, such as ``input_2022*.csv``, as an attempt to create multiple
datasets from one declaration, and they will not work.
-A dataset should be created with a valid URI. Airflow core and providers
define various URI schemes that you can use, such as ``file`` (core),
``postgres`` (by the Postgres provider), and ``s3`` (by the Amazon provider).
Third-party providers and plugins may also provide their own schemes. These
pre-defined schemes have individual semantics that are expected to be followed.
+You must create datasets with a valid URI. Airflow core and providers define
various URI schemes that you can use, such as ``file`` (core), ``postgres`` (by
the Postgres provider), and ``s3`` (by the Amazon provider). Third-party
providers and plugins might also provide their own schemes. These pre-defined
schemes have individual semantics that are expected to be followed.
What is valid URI?
------------------
-Technically, the URI must conform to the valid character set in RFC 3986. If
you don't know what this means, that's basically ASCII alphanumeric characters,
plus ``%``, ``-``, ``_``, ``.``, and ``~``. To identify a resource that cannot
be represented by URI-safe characters, encode the resource name with
`percent-encoding <https://en.wikipedia.org/wiki/Percent-encoding>`_.
+Technically, the URI must conform to the valid character set in RFC 3986,
which is basically ASCII alphanumeric characters, plus ``%``, ``-``, ``_``,
``.``, and ``~``. To identify a resource that cannot be represented by URI-safe
characters, encode the resource name with `percent-encoding
<https://en.wikipedia.org/wiki/Percent-encoding>`_.
Review Comment:
We should probably add a link to the Wikipedia entry on URI somewhere too.
https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]