Re: [boinc_dev] 6.6.20 and work scheduling

John . McLeod Tue, 28 Apr 2009 10:22:04 -0700

jm7


[email protected] wrote on 04/28/2009 12:33:56 PM:

> Martin <[email protected]>
> Sent by: [email protected]
>
> 04/28/2009 12:33 PM
>
> To
>
> BOINC Developers List <[email protected]>
>
> cc
>
> Subject
>
> Re: [boinc_dev] 6.6.20 and work scheduling
>
> >> There appears to be a phenomenal amount of effort both in programming
> >> and in scheduler CPU time in trying to meet exactly all deadlines down
> >> to the last millisecond and for all eventualities.
> >
> > At least not exceed the deadlines in as many cases as possible.
>
> Agreed.
>
> Also, no need to frantically check all scheduling calculations for every
> change of system state.
>
> What does it matter if we work to a granularity of once per TSI period?
>
Sometimes we do, sometimes we don't.  If a task ends 20 seconds into a Task
Scheduling Interval, we would get beaten up if we allowed the CPU to remain
idle for the next hour.

The Task Scheduling Interval is user settable, and there is no upper limit.
Of course, if it is too large, unexpected things start to happen.

> Do projects junk results that are 1 second beyond the deadline? (If so,
> then I'll bet that WU result transfer and validate time isn't allowed
> for...)

That depends on the project.

No they are not.  (they used to be, but that part of the safety margin was
removed).
>
> To follow KISS, the deadlines enforced by the project servers must be
> 'soft' and allow for "deadline + 10 * client_default_TSI".
>
Since the TSI is user settable, and some projects have real world
deadlines, this is NOT going to happen.  Some projects have deadlines that
need to be shorter than 10 * the default, or 10 hours.  I believe that Paul
has it set to 6 hours.

>
> >> In other words, move to a KISS solution?
> >
> > As long as it is not too simple.  Sometimes the obvious simple solution
> > does not work.
>
> A good simple solution is designed very cleverly to be inherently robust.
>
True enough, but it is possible to make it too simple, and not take enough
boundary cases into consideration.
>
> >> New rule:
> >>
> >> If we are going to accept that the project servers are going to be or
> >> can be unreasonable, then the client must have the option to be
equally
> >> obnoxious (but completely honest) and reject WUs that are unworkable,
> >> rather than attempting a critical futility (and failing days later).
> >>
> >> Add a bit of margin and then you can have only the TSI, and user
> >> suspend/release as your scheduler trigger events. The work fetch and
> >> send just independently keeps the WU cache filled but not overfilled.
> >>
> > Tasks come in discrete units, they do not come in convenient sized
> > packages.  CPDN can run for months on end, and because of this, it was
>
> That's fine. Big WUs have a long deadline.

Usually.  Tell that to my computers that are having trouble getting AP done
by deadline.  Sometimes it is not enough longer.
>
> The only 'problem' there is that a CPDN WU will block all other projects
> on a single CPU core system.

There are still a LOT of single core systems running BOINC.  BOINC can also
decide to download one CPDN task per core because CPDN has the highest LTD.
So, even on a multi core system it is possible to have a long block of CPDN
only.
>
> To overcome that, perhaps we need to change the cache semantics from
> that of having one absolute cache into WU times are accumulated, to the
> semantic of where the cache is proportionately divided amongst all the
> active projects for a host. The cache holds a minimum of work for each
> of the projects in proportion to the user set resource share for each
> project.
>
Sort of already done.  If a project is not overworked (too low an LTD) it
will get some fraction of the total queue.  Of course, the total queue has
to be maintained at a minimum level for those with network connections that
are not always on.  And there is the fact that tasks come as discrete
packets, so overfilling the projects share of the queue is very common.
Under filling due to lack of work is also common.

> The TSI period then swaps the work as expected.
>
>
> >> Immediately junk unstarted WUs for resend elsewhere if deadline
trouble
> >> ensues, rather than panic to try to scrape them in late.
> >>
> >> That will also implement a natural feedback loop for project admins to
> [...]
> > Servers treat aborted tasks as errors.  Your daily quota is reduced by
one
> > for each one of these.  This leads to the problem:
> >
> > Request a second of work from project A.
> > Run into deadline trouble.
> > Abort the task.
> > Since A is STILL the project with the highest LTD:  Request a second of
> > work from project A.
> > Run into deadline trouble.
> > Abort the task.
>
> And repeat to show an implementation bug...
>
>
> This is where the *client must refuse to download the WU* in the first
> place.
>
The client has ONE connection to contact the server for work.  More
connections will cause stress on some project servers.

You are right that there is a bug here.  It is, however a design flaw in
your design.

> That is, client requests work, server offers something, client refuses
> and goes into a backoff period before asking again.
>
> Upon later requests, either the server will have something more
> reasonable to offer, or the client will be farther away from deadline
> problems.
>
> (That will also save bandwidth over futilely downloading WUs and then
> junking them in any case.)
>
Please note the running in EDF starts before a task is absolutely known to
be late - in order to prevent late work.  Normally, there is no need to
junk work as it is downloaded.
>
> > We have already seen this in s...@h where there is a custom application
that
> > cherry picks CUDA tasks (rejecting the ones that are known to take a
very
> > long time on CUDA).  This has driven the daily quota of some people
well
> > below what their computer can actually do.  We do not want a repeat of
that
> > intentionally in the client.
>
> That's where the source design must be fixed before it gets fixed for us
> by the users (participants) in other ways...
>
>
> Allowing a large granularity in the scheduling eases a lot of the
> deadlines criticalities and special cases to be moot.
>
No, it makes the scheduling criticalities worse as the computation deadline
has to be moved earlier.

> There also needs to be some form of hysteresis ("2 * TSI period"?) in
> moving between 'relaxed scheduling' and EDF 'panic'.
>
It is not always possible, and the next TSI should do most of the time, no
need for time after next.  I would like to see a separation between
starting in on EDF and an immediate pre-empt.  Starting in on EDF typically
(but not always) be done at the next task switch.  This would involve
having two different tests.  One for EDF required, and one for preempt
immediately required.  Of course, the second need only be run if the first
test is positive.

> Regards,
> Martin
>
> --
> --------------------
> Martin Lomas
> m_boincdev ml1 co uk.ddSPAM.dd
> --------------------
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] 6.6.20 and work scheduling

Reply via email to