[GitHub] [airflow] dstandish commented on issue #25186: should we normalize scheme and hostname of URI to be lower?

GitBox Wed, 20 Jul 2022 14:58:06 -0700


dstandish commented on issue #25186:
URL: https://github.com/apache/airflow/issues/25186#issuecomment-1190803026


   It might actually be better to leave such normalization process to the 
lineage analysis platform.  Something like a marquez or what have you.  
Because, for a project using a lineage tool, you may need to be able to 
synthesize your lineage data from multiple sources -- not just airflow but also 
spark or DBT or databricks or what have you.  If airflow is "normalizing" the 
URIs -- in other words, taking the user-supplied string value, and changing it 
-- then you might get unexpected, and undesirable results.
   
   Within an airflow cluster, if a dag depends on a dataset that airflow has no 
knowledge about, and it's due to a false negative, then this is something that 
we can actually warn the user about.  Normalizing can actually produce a 
messier situation, because you allow users to have code that is inconsistent 
with the data (i.e. the URI value as registered in airflow) and this is not a 
good situation in a data engineering project.  E.g. if you are trying to search 
_the code_ for references, you might not find them.
   
   What I would be more ok with is throwing an error when Dataset is given a 
URI with a hostname or scheme thas not already lowercase.
   
   ```
   def validate_hostname(val):
       parsed = urlsplit(val)
       hostname = parsed._hostinfo[0]
       if hostname and hostname != hostname.lower():
           raise ValueError("hostname must be lowercase")
   
   def validate_scheme(val):
       i = val.find(':')
       if i > 0:
           scheme = val[:i]
           if scheme != scheme.lower():
               raise ValueError("scheme must be lowercase")
   ```
   
   This forces the code to be consistent with data, while not taking on a huge 
complexity burden.
   
   Note we can't simply check `parsed.scheme` or `parsed.hostname` because they 
are already lowered.
   
   Normalizing and reassembling the URI, particularly if you want to take on 
percent encoding and this kind of thing, would get very messy and force us to 
take on a lot of complexity that, at least to me, does not sound appealing, and 
does not seem like something airflow really needs to handle.  With regard to 
false negatives, it's not fundamentally different from referencing another task 
ID in external task sensor or a dag id in trigger dag operator -- you have to 
get the reference right, and that's your responsibility as a user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] dstandish commented on issue #25186: should we normalize scheme and hostname of URI to be lower?

Reply via email to