potiuk commented on issue #25186: URL: https://github.com/apache/airflow/issues/25186#issuecomment-1191135361
I think we should at least detect "fat finger" problems. I.e when somoene *Inside airflow installation* creates two different datasets with equivalent urls, we should not allow that. We are able to do do that very easily and warn the user. I am perfectly ok with storing dataset with URL without normalisation. But at least we should have an unique index which will prevent the user from creating two different datasets with two equivalent (but different) URIs. This is not difficult. We can for example fully normalize the URI, convert into base64-encoded string and save it as "unique_id" or smth in the database. Then whenever we are inserting a dataset with different URI and same "unique_id", we simply fail with: "This URI here is the same as that URI there". Shoudl not be very complex, and I think it prevents users from making silly errors that will be difficult to debug otherwise. There is no real drawback of it that I can think except 'generate_unique_id" using normalisation. Pretty much no performance penaly, much better user experience. Airflow helping user to make less mistakes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
