Hey Bolke, Thanks for being so diligent with this. I think this work is critical for the project. Looking forward to a much more stable scheduler.
Cheers, Chris On Wed, Jun 1, 2016 at 3:13 AM, Jeremiah Lowin <[email protected]> wrote: > Just to be clear this is a highly unlikely event. I used to have a unit > test for it but got rid of it when we closed bugs that made it impossible > to cause such a crash deterministically. So this situation is possible but > almost certainly won't manifest. > > On Wed, Jun 1, 2016 at 4:00 AM Bolke de Bruin <[email protected]> wrote: > > > Hey, > > > > This is to give a heads up that I am planning to merge #1514, the > refactor > > of process_dag, today. This is the second step in executing on the > > scheduler roadmap. It has been running in our production for a week now > > with no functional differences. Scheduler loop times start a bit higher, > > but have a lower max. Amount of connections to the database is round 1/3 > of > > the previous scheduler (test dag went from 150 connections to 50). > Database > > load slightly lower. > > > > While fixing many issues (race conditions), a corner case mentioned by > > Jeremiah is now present. A TI is sent in SCHEDULED state to the executor. > > The executor fails in loading the TI then the TI might be orphaned > forever. > > As fixing the corner case will require further fundamental changes we > > discussed it should be addressed in a follow up patch. > > > > My planned next steps are 1) reduce scheduler loop time to around 1s by > > making task reporting “event driven”. 2) auto-align start date 3) add > > notion of “previous” to dagrun 4) fix corner case mentioned above. > > > > - Bolke > > > > > > > > >
