dstandish commented on issue #25186:
URL: https://github.com/apache/airflow/issues/25186#issuecomment-1190803026
It might actually be better to leave such normalization process to the
lineage analysis platform. Something like a marquez or what have you.
Because, for a project using a lineage tool, you may need to be able to
synthesize your lineage data from multiple sources -- not just airflow but also
spark or DBT or databricks or what have you. If airflow is "normalizing" the
URIs -- in other words, taking the user-supplied string value, and changing it
-- then you might get unexpected, and undesirable results.
Within an airflow cluster, if a dag depends on a dataset that airflow has no
knowledge about, and it's due to a false negative, then this is something that
we can actually warn the user about. Normalizing can actually produce a
messier situation, because you allow users to have code that is inconsistent
with the data (i.e. the URI value as registered in airflow) and this is not a
good situation in a data engineering project. E.g. if you are trying to search
_the code_ for references, you might not find them.
What I would be more ok with is throwing an error when Dataset is given a
URI with a hostname or scheme thas not already lowercase.
```
def validate_hostname(val):
parsed = urlsplit(val)
hostname = parsed._hostinfo[0]
if hostname and hostname != hostname.lower():
raise ValueError("hostname must be lowercase")
def validate_scheme(val):
i = val.find(':')
if i > 0:
scheme = val[:i]
if scheme != scheme.lower():
raise ValueError("scheme must be lowercase")
```
This forces the code to be consistent with data, while not taking on a huge
complexity burden.
Note we can't simply check `parsed.scheme` or `parsed.hostname` because they
are already lowered.
Normalizing and reassembling the URI, particularly if you want to take on
percent encoding and this kind of thing, would get very messy and force us to
take on a lot of complexity that, at least to me, does not sound appealing, and
does not seem like something airflow really needs to handle. With regard to
false negatives, it's not fundamentally different from referencing another task
ID in external task sensor or a dag id in trigger dag operator -- you have to
get the reference right, and that's your responsibility as a user.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]