Hey, This is to give a heads up that I am planning to merge #1514, the refactor of process_dag, today. This is the second step in executing on the scheduler roadmap. It has been running in our production for a week now with no functional differences. Scheduler loop times start a bit higher, but have a lower max. Amount of connections to the database is round 1/3 of the previous scheduler (test dag went from 150 connections to 50). Database load slightly lower.
While fixing many issues (race conditions), a corner case mentioned by Jeremiah is now present. A TI is sent in SCHEDULED state to the executor. The executor fails in loading the TI then the TI might be orphaned forever. As fixing the corner case will require further fundamental changes we discussed it should be addressed in a follow up patch. My planned next steps are 1) reduce scheduler loop time to around 1s by making task reporting “event driven”. 2) auto-align start date 3) add notion of “previous” to dagrun 4) fix corner case mentioned above. - Bolke
