Hi Devs,
Just wanted to share a production scenario that our team is trying to solve here in our organization. Just a little background, we have 20+ Airflow(v1.10.2, with Kubernetes Executor and MySQL as meta-store) clusters in our organization and have 10k active workflows and 100k daily runs. Every solution has different nature of workflow scheduling. Some of them runs them on scheduled basis while other triggers them on-demand basis. The minimum scheduling interval that one can configure could be at minute level, so maximum number of runs which can go for such workflow will be 1440/day and 43200/month, so if such workflow are not getting deleted for 2-3 months, previous runs details will be available in meta-store and problem gets amplified with ad-hoc triggers, where runs can go even larger and that’s where we start hitting the performance issues on Airflow UI and perhaps on schedule too because of slowness in results retrieval from the MySQL(there are few bad queries which gets formulated to use IN clause). We were thinking to expose a policy at workflow/cluster level which can restrict number of runs that can be preserved in meta-store for a workflow. Couple of things which will be required for the same. 1. *Need to define what’s the definition of older runs*– Can it be time bound (by exposing variable like maxOlderRunsInDays) or restricting at number of runs(by exposing variable like keepMaxTotalRuns) after which new dag_run creation will archive 1 older run in asc order or could it be function of both? I guess most important would be to control at number of runs level but other could be sensible too depending upon use case. 2. *Archive older runs data*, may be in same meta-store with archival tables having same schema as models but with flexibility in constraints. For time bound archival, let’s say 30 days history is the policy then we would be requiring a process within Airflow or outside it (may be a DAG with periodical runs which can archive older runs) and if we go for restriction at number of runs level, then perhaps it will be easier to make provision in airflow code to handle during dag_run creation block. Though, we haven’t done a formal benchmarking for the slowness observed but we plan to do that as it will help in knowing the limit we want to apply to the system. Would be happy to hear back from community about how they feel about the problem? Regards, Vardan Gupta
