[I] Scheduler unable to process large number of orphan Datasets [airflow]

via GitHub Thu, 01 Aug 2024 04:55:51 -0700


NBardelot opened a new issue, #41185:
URL: https://github.com/apache/airflow/issues/41185


   ### Apache Airflow version
   
   2.9.3
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   When it starts, the scheduler starts marking Datasets as orphans in 
`airflow/jobs/scheduler_job_runner.py` in the method 
`_orphan_unreferenced_datasets`.
   
   A large number of Datasets is queried from the Metadata DB (~200K). It seems 
the query does not filter Datasets that are already orphans (it queries with 
HAVING both no DAG nor Task reference; thus selecting also Datasets already 
flagged as orphans).
   
   As we use Kubernetes, we have a livenessProbe failing for the Scheduler 
since it helplessly tries to orphan everything in a single run (Note: 
AIRFLOW__SCHEDULER__PARSING_CLEANUP_INTERVAL is 60s, but changing the value 
would not change this behaviour).
   
   A lot of `INFO - Orphaning unreferenced dataset` are logged in the Scheduler 
output.
   
   Kubernetes kills the unhealthy pod. The pod starts anew, and again fails to 
orphan everything at once. Etc. Etc.
   
   ### What you think should happen instead?
   
   Several thing could be done to improve the situation:
   
   1. Limit the number of Datasets selected in the query in 
`_orphan_unreferenced_datasets`, so that the `_set_orphaned` works by batches 
of Datasets to orphan (new config 
`AIRFLOW__SCHEDULER__UNREFERENCED_DATASETS_BATCH_SIZE` for example)
   2. Do not select already orphaned Datasets, by improving the query in 
`_orphan_unreferenced_datasets`
   3. Add an optional TTL to Datasets in the Metadata DB so that old Datasets 
can be ignored by the query and/or cleaned up without the need to orphan them 
beforehand (if the TTL has passed, then orphan or not is not relevant and the 
computation can be skipped)
   
   ### How to reproduce
   
   Create a very large number of Datasets, that the Scheduler cannot process in 
one round, faster than its livenessProbe, so that the livenessProbe is not 
processed and the pod is considered unhealthy.
   
   ### Operating System
   
   Kubernetes
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon
   apache-airflow-providers-common-sql
   apache-airflow-providers-elasticsearch
   apache-airflow-providers-hashicorp
   apache-airflow-providers-http
   apache-airflow-providers-microsoft-winrm
   apache-airflow-providers-microsoft-azure
   apache-airflow-providers-opsgenie
   apache-airflow-providers-postgres
   apache-airflow-providers-redis
   apache-airflow-providers-sftp
   apache-airflow-providers-smtp
   apache-airflow-providers-ssh
   
   with versions as of Airflow 2.9.3 constraints.
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   AIRFLOW__SCHEDULER__PARSING_CLEANUP_INTERVAL = 60s (default value)
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Scheduler unable to process large number of orphan Datasets [airflow]

Reply via email to