Nataneljpwd commented on PR #64294: URL: https://github.com/apache/airflow/pull/64294#issuecomment-4596382759
> I'm not convinced that this is the right fix. Tuning and configuring the scheduler is already nigh on impossible I am wary of adding more. We have tried running both the scheduler count and the max dagruns per loop to schedule, and each time we had a different issue but I understand the concern, we have this locally and it fixed our problem, the main problem being is that dags are created in batches at our clusters, sometimes very large batches, and a new dagrun is heavier to process than a running one, mainly due to the fact of having to create tasks for it (when it starts) rather than other dagruns which occasionally (once tasks finish) create new tasks, while also having dagruns not moved to running due to processing large batches of new dagruns, we have tried increasing the scheduler count quite a bit, in addition to increasing the max dagruns per loop to schedule, which caused scheduler heartbeat timeouts as we had a lot of runs with mapped tasks, and so we had to also increase that configuration to a very big (10 minutes), that is in addition to dagruns timing out due to not being examined, and we even saw in t he gant that there were large pauses between tasks where no task existed, and so dagruns could cause other dagruns to miss their sla, or even if I create a medium backfill, along with my regular dags which include mapped tasks, when I increase the number of examined dagruns, I get one of the issues stated above. > Additionally, couldn't the already existing max active runs controls be used here? That would keep most of the dagruns in the Queued state, meaning the scheduler only looks at at most 16( by default I think) newly created runs and massively reduces the impact of "cause the scheduler to stall, as it has to both examine a lot of dagruns, and create new tasks for those dagruns." as it doesn't do that. That is why DagRuns can exist in the queued state. As states above, we had tried to tune it, we changed it to around 300 and even tripled the scheduler count, yet for both batch triggered runs and large backfills we still experienced the issue, we even tried dividing the batch size by a few times (spread more evenly), the scheduler either got a lot of queued dagruns and would never finish the batch OR when it was able to finish the batch it was reset quite often due to not emmiting a heartbeat and failing the readiness probe / having an oom / other dagruns timing out (which Is why we didn't increase the number beyond 300) > Did you try this existing tunable first? As states above, yes, we have tried, I am pretty sure we had tried all related configurations, as I have gone over all of the scheduler configurations -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
