potiuk commented on issue #25186:
URL: https://github.com/apache/airflow/issues/25186#issuecomment-1190627253

   Actually again the RFC has interesting discussion result that explain 
possibilities with pros/cons. We are not the first to raise the question.  
There is chapter `6. Normalisation and Comparision` - especially "Equivalence" 
and "Comparision Ladder". Go read it @dstandish  (and others who would like to 
comment). It's a very interstng read. We are not bound by the RFC rules, we 
just have to decide what level of false negatives we we are going to get.
   
   There are two things that we heve not discussed  - percent-encoding 
normalisation and path-segment normalization (all described in the RFC).
   
   After reading it, and thinking a bit more - I think - personally we should 
normalise all: case, percent and path. I think we should not do 
"scheme"-normalisation (that would really require pluggable scheme normaliser 
eventually), and for sure we should not do "protocol" normalization.
   
   Why?
   
   The reason I think we should do it is because this is outward-looking data 
and because our URIs more often than not will be typed by users writing DAGs. 
Eventually what we are trying to do is to find out wheher two unrelated DAGs 
are operating on the same dataset. Any false negative there is a problem - not 
only for Airflow but also when the "lineage" data will be reported by Airflow 
to others.
   
   If we want Airflow to become the source for other systems for lineage and 
dataset data in the future, we should I think take as much responsibility as 
possible to filter out false-negatives. People writing DAGs will perform silly 
mistakes and accidental casing mistakes will happen more ofen than you think 
(for example there is a very nasty key combination in vi that I sometimes 
trigger accidentally which uppercases whatever letter is underneath). If we do 
not normalise, this will mean that DAGs will work kinda properly. the dataset 
will be produced by source DAG, and it will be consumed by the Other DAG (for 
example s3://a, s3://A). But the link between those two DAGs will be broken. 
And they will not trigger one another. Maybe they will be triggered 
accidentally differently (if there also will be time trigger) - but the link 
will be lost and it's not going to be obvious why.
   
   Not fully normalizing is fine for cache (Ok to have two copies) for example. 
But not for lineage.
   
   IMHO - by not normalizing as much as we can, instead of solving the problem 
by automation (we can implement it, no problem, it's just complex), we are 
delegting the task to humans who are monitoring the DAGs. I think we will be 
much better in that task though than the humans looking at dags.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to