Scheduler loop times are definitely a concern (at least for Airbnb), and +1
for option 2 as well if it can be implemented correctly. What is important
for me is that we should always be able to easily tell which of the
dependencies are met and which aren't in the event based model.

On Fri, Jun 3, 2016 at 5:53 PM, Chris Riccomini <[email protected]>
wrote:

> Hey Bolke,
>
> > Are scheduler loop times a concern at all?
>
> Yes, I strongly believe that they are. Especially as we add more
> DAGs/tasks.
>
> I am not a fan of (1). Caching is just going to create cache consistency
> issues, and be really annoying to manage, IMO.
>
> I agree that (2) seems more appealing. I can't comment on the feasibility
> of it, as I'm not well acquainted enough with the scheduler yet.
>
> Cheers,
> Chris
>
> On Fri, Jun 3, 2016 at 2:26 PM, Bolke de Bruin <[email protected]> wrote:
>
> > Hi,
> >
> > I am looking at speeding up the scheduler. Currently loop times increase
> > with the amount of tasks in a dag. This is due to
> > TaskInstance.are_depedencies_met executing several aggregation functions
> on
> > the database. These calls are expensive: between 0.05-0.15s per task and
> > for every scheduler loop this gets called twice. This call is where the
> > scheduler spends around 90% of its time when evaluating dags and is the
> > reason for people that have a large amount of tasks per dag to so quite
> > large loop times (north of 600s).
> >
> > I see 2 options to optimize the loop without going to a multiprocessing
> > approach which will just put the problem down the line (ie. the db or
> when
> > you don’t have enough cores anymore).
> >
> > 1. Cache the call to TI.are_dependencies_met by either caching in a
> > something like memcache or removing the need for the double call
> > (update_state and process_dag both make the call to
> > TI.are_dependencies_met). This would more or less cut the time in half.
> >
> > 2. Notify the downstream tasks of a state change of a upstream task. This
> > would remove the need for the aggregation as the task would just ‘know’.
> It
> > is a bit harder to implement correctly as you need to make sure you keep
> > being in a consistent state. Obviously you could still run a integrity
> > check once in a while. This option would make the aggregation event based
> > and significantly reduce the time spend here to around 1-5% of the
> current
> > scheduler. There is a slight overhead added at a state change of the
> > TaskInstance (managed by the TaskInstance itself).
> >
> > What do you think? My preferred option is #2. Am i missing any other
> > options? Are scheduler loop times a concern at all?
> >
> > Thanks
> > Bolke
> >
> >
> >
>

Reply via email to