eladkal commented on code in PR #37005:
URL: https://github.com/apache/airflow/pull/37005#discussion_r1498891685


##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -63,10 +63,19 @@ A dataset is defined by a Uniform Resource Identifier (URI):
 
 Airflow makes no assumptions about the content or location of the data 
represented by the URI. It is treated as a string, so any use of regular 
expressions (eg ``input_\d+.csv``) or file glob patterns (eg 
``input_2022*.csv``) as an attempt to create multiple datasets from one 
declaration will not work.
 
-There are two restrictions on the dataset URI:
+A dataset should be created with a valid URI. Airflow core and providers 
define various URI schemes that you can use, such as ``file`` (core), ``https`` 
(by the HTTP provider), and ``s3`` (by the Amazon provider). Third-party 
providers and plugins may also provide their own schemes. These pre-defined 
schemes have individual semantics that are expected to be followed.
 
-1. It must be a valid URI, which means it must be composed of only ASCII 
characters.
-2. The URI scheme cannot be ``airflow`` (this is reserved for future use).
+.. note::
+
+    Technically, the URI must conform to the valid character set in RFC 3986. 
If you don't know what this means, that's basically ASCII alphanumeric 
characters, plus ``%``,  ``-``, ``_``, ``.``, and ``~``. To identify a resource 
that cannot be represented by URI-safe characters, encode the resource name 
with `percent-encoding <https://en.wikipedia.org/wiki/Percent-encoding>`_.
+
+    The URI is also case sensitive, so ``s3://example/dataset`` and 
``s3://Example/Dataset`` are considered different. Note that the *host* part of 
the URI is also case sensitive, which differs from RFC 3986.
+
+    Airflow always prefers using lower cases in schemes, and case sensitivity 
is needed in the host part to correctly distinguish between resources.
+
+If you wish to define datasets with a scheme without additional semantic 
constraints, use a scheme with the prefix ``x-``. Airflow will skip any 
semantic validation on URIs with such schemes.

Review Comment:
   I think an example would be useful here.



##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -63,10 +63,19 @@ A dataset is defined by a Uniform Resource Identifier (URI):
 
 Airflow makes no assumptions about the content or location of the data 
represented by the URI. It is treated as a string, so any use of regular 
expressions (eg ``input_\d+.csv``) or file glob patterns (eg 
``input_2022*.csv``) as an attempt to create multiple datasets from one 
declaration will not work.
 
-There are two restrictions on the dataset URI:
+A dataset should be created with a valid URI. Airflow core and providers 
define various URI schemes that you can use, such as ``file`` (core), ``https`` 
(by the HTTP provider), and ``s3`` (by the Amazon provider). Third-party 
providers and plugins may also provide their own schemes. These pre-defined 
schemes have individual semantics that are expected to be followed.
 
-1. It must be a valid URI, which means it must be composed of only ASCII 
characters.
-2. The URI scheme cannot be ``airflow`` (this is reserved for future use).
+.. note::

Review Comment:
   Lets change this from a note to a title: What is valid URI?
   I prefer title because for titles we have links thus we can share link 
directly to it.



##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -63,10 +63,19 @@ A dataset is defined by a Uniform Resource Identifier (URI):
 
 Airflow makes no assumptions about the content or location of the data 
represented by the URI. It is treated as a string, so any use of regular 
expressions (eg ``input_\d+.csv``) or file glob patterns (eg 
``input_2022*.csv``) as an attempt to create multiple datasets from one 
declaration will not work.
 
-There are two restrictions on the dataset URI:
+A dataset should be created with a valid URI. Airflow core and providers 
define various URI schemes that you can use, such as ``file`` (core), ``https`` 
(by the HTTP provider), and ``s3`` (by the Amazon provider). Third-party 
providers and plugins may also provide their own schemes. These pre-defined 
schemes have individual semantics that are expected to be followed.

Review Comment:
   Do we have a way to auto generated a page with list of all available  URI 
schemes?
   I think we already [auto generate notifiers from the 
provider.yaml](https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/notifications.html#amazon)
 so I imagine same logic of generation can be used to create a list for here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to