jedcunningham commented on code in PR #37005:
URL: https://github.com/apache/airflow/pull/37005#discussion_r1500259622
##########
tests/datasets/test_dataset.py:
##########
@@ -45,18 +45,31 @@ def clear_datasets():
pytest.param("", id="empty"),
pytest.param("\n\t", id="whitespace"),
pytest.param("a" * 3001, id="too_long"),
- pytest.param("airflow:" * 3001, id="reserved_scheme"),
- pytest.param("😊" * 3001, id="non-ascii"),
+ pytest.param("airflow://xcom/dag/task", id="reserved_scheme"),
+ pytest.param("😊", id="non-ascii"),
+ pytest.param("ftp://user@localhost/foo.txt", id="has-auth"),
Review Comment:
Is this (user info) the only breaking change for core? I know people
shouldn't have done it, but I'm not sure we should blow up if they have.
##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -63,28 +63,46 @@ A dataset is defined by a Uniform Resource Identifier (URI):
Airflow makes no assumptions about the content or location of the data
represented by the URI. It is treated as a string, so any use of regular
expressions (eg ``input_\d+.csv``) or file glob patterns (eg
``input_2022*.csv``) as an attempt to create multiple datasets from one
declaration will not work.
-There are two restrictions on the dataset URI:
+A dataset should be created with a valid URI. Airflow core and providers
define various URI schemes that you can use, such as ``file`` (core), ``https``
(by the HTTP provider), and ``s3`` (by the Amazon provider). Third-party
providers and plugins may also provide their own schemes. These pre-defined
schemes have individual semantics that are expected to be followed.
Review Comment:
You mention the https provider defines a scheme, but I don't see that in
this PR?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]