Nataneljpwd commented on PR #64294:
URL: https://github.com/apache/airflow/pull/64294#issuecomment-4596382759

   > I'm not convinced that this is the right fix. Tuning and configuring the 
scheduler is already nigh on impossible I am wary of adding more.
   
   We have tried running both the scheduler count and the max dagruns per loop 
to schedule, and each time we had a different issue but I understand the 
concern, we have this locally and it fixed our problem, the main problem being 
is that dags are created in batches at our clusters, sometimes very large 
batches, and a new dagrun is heavier to process than a running one, mainly due 
to the fact of having to create tasks for it (when it starts) rather than other 
dagruns which occasionally (once tasks finish) create new tasks, while also 
having dagruns not moved to running due to processing large batches of new 
dagruns, we have tried increasing the scheduler count quite a bit, in addition 
to increasing the max dagruns per loop to schedule, which caused scheduler 
heartbeat timeouts as we had a lot of runs with mapped tasks, and so we had to 
also increase that configuration to a very big (10 minutes), that is in 
addition to dagruns timing out due to not being examined, and we even saw in t
 he gant that there were large pauses between tasks where no task existed, and 
so dagruns could cause other dagruns to miss their sla, or even if I create a 
medium backfill, along with my regular dags which include mapped tasks, when I 
increase the number of examined dagruns, I get one of the issues stated above.
   
   > Additionally, couldn't the already existing max active runs controls be 
used here? That would keep most of the dagruns in the Queued state, meaning the 
scheduler only looks at at most 16( by default I think) newly created runs and 
massively reduces the impact of "cause the scheduler to stall, as it has to 
both examine a lot of dagruns, and create new tasks for those dagruns." as it 
doesn't do that. That is why DagRuns can exist in the queued state.
   
   As states above, we had tried to tune it, we changed it to around 300 and 
even tripled the scheduler count, yet for both batch triggered runs and large 
backfills we still experienced the issue, we even tried dividing the batch size 
by a few times (spread more evenly), the scheduler either got a lot of queued 
dagruns and would never finish the batch OR when it was able to finish the 
batch it was reset quite often due to not emmiting a heartbeat and failing the 
readiness probe / having an oom / other dagruns timing out (which Is why we 
didn't increase the number beyond 300)
   
   > Did you try this existing tunable first?
   
   As states above, yes, we have tried, I am pretty sure we had tried all 
related configurations, as I have gone over all of the scheduler configurations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to