[GitHub] [airflow] blag commented on a diff in pull request #27745: Remove datasets that are no longer referenced anywhere

GitBox Mon, 21 Nov 2022 21:02:47 -0800


blag commented on code in PR #27745:
URL: https://github.com/apache/airflow/pull/27745#discussion_r1028830891



##########
airflow/models/dag.py:
##########
@@ -2822,6 +2822,11 @@ def bulk_write_to_db(
         all_datasets = outlet_datasets
         all_datasets.update(input_datasets)
 
+        # Save this set of URIs for later, since we del all_datasets before we 
use this
+        datasets_to_remove = {_[0] for _ in 
session.query(DatasetModel.uri).all()} - {
+            k.uri for k in all_datasets.keys()

Review Comment:
   Well shoot, you're right. I was using `airflow dags reserialize` to 
explicitly reload DAGs, which does load all of the DAGs at once and gives a 
correct overview of all referenced (and more importantly, unreferenced) 
datasets. But the DAG processor setup loads individual `.py` files at a time, 
and does not have a global view of all datasets. So this approach isn't going 
to work for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] blag commented on a diff in pull request #27745: Remove datasets that are no longer referenced anywhere

Reply via email to