potiuk commented on issue #25186: URL: https://github.com/apache/airflow/issues/25186#issuecomment-1190627253
Actually again the RFC has interesting discussion result that explain possibilities with pros/cons. We are not the first to raise the question. There is chapter `6. Normalisation and Comparision` - especially "Equivalence" and "Comparision Ladder". Go read it @dstandish (and others who would like to comment). It's a very interstng read. We are not bound by the RFC rules, we just have to decide what level of false negatives we we are going to get. There are two things that we heve not discussed - percent-encoding normalisation and path-segment normalization (all described in the RFC). After reading it, and thinking a bit more - I think - personally we should normalise all: case, percent and path. I think we should not do "scheme"-normalisation (that would really require pluggable scheme normaliser eventually), and for sure we should not do "protocol" normalization. Why? The reason I think we should do it is because this is outward-looking data and because our URIs more often than not will be typed by users writing DAGs. Eventually what we are trying to do is to find out wheher two unrelated DAGs are operating on the same dataset. Any false negative there is a problem - not only for Airflow but also when the "lineage" data will be reported by Airflow to others. If we want Airflow to become the source for other systems for lineage and dataset data in the future, we should I think take as much responsibility as possible to filter out false-negatives. People writing DAGs will perform silly mistakes and accidental casing mistakes will happen more ofen than you think (for example there is a very nasty key combination in vi that I sometimes trigger accidentally which uppercases whatever letter is underneath). If we do not normalise, this will mean that DAGs will work kinda properly. the dataset will be produced by source DAG, and it will be consumed by the Other DAG (for example s3://a, s3://A). But the link between those two DAGs will be broken. And they will not trigger one another. Maybe they will be triggered accidentally differently (if there also will be time trigger) - but the link will be lost and it's not going to be obvious why. Not fully normalizing is fine for cache (Ok to have two copies) for example. But not for lineage. IMHO - by not normalizing as much as we can, instead of solving the problem by automation (we can implement it, no problem, it's just complex), we are delegting the task to humans who are monitoring the DAGs. I think we will be much better in that task though than the humans looking at dags. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
