Re: [boinc_dev] 6.6.20 and work scheduling

John . McLeod Thu, 30 Apr 2009 05:48:36 -0700

jm7


[email protected] wrote on 04/29/2009 04:15:50 PM:

> "Paul D. Buck" <[email protected]>
> Sent by: [email protected]
>
> 04/29/2009 04:15 PM
>
> To
>
> David Anderson <[email protected]>
>
> cc
>
> [email protected], BOINC Developers Mailing List
> <[email protected]>
>
> Subject
>
> Re: [boinc_dev] 6.6.20 and work scheduling
>
>
> On Apr 28, 2009, at 12:14 PM, David Anderson wrote:
>
> > At this point I'm interested in reports of bad scheduling
> > or work-fetch decisions, or bad debt calculation.
> > When these are all fixed we can address the efficiency
> > of the scheduling calculations (if it's an issue).
>
>
> Is the point of the exercise to fix these problems?
>
> Or are you only looking for a band-aid to cover them up?
>
> There is a ton of bad things happening and they are exaggerated by the
> wider systems.  To correctly address these issues a fundamentally new
> approach needs to be tried (IMO, for what that is worth).  There is no
> single line of code to patch to make it all better, but a collision of
> the original design choices and the scaling up of system's capabilities.
>
> > bad scheduling
>
> Is subject to the following:
>
> 1) We do it too often (event driven)
Exactly what we are not listening to.  The rate of tests is NOT the reason
for incorrect switches.
> 2) All currently running tasks are eligible for preemption
Not completely true, and not the problem.  Tasks that are not in the list
of want to run are preemptable, tasks that are in the list of want to run
are preemptable.  They should only be preempted if either that task is past
its TSI, or there is a task with deadline trouble (please work with me on
the definition of deadline trouble).
> 3) TSI is not respected as a limiting factor
It cannot be in all cases.  There may be more cases where the TSI could be
honored.
> 4) TSI is used in calculating deadline peril
And it has to be.  Since tasks may (or may not) be re-scheduled at all
during a TSI, and the TSI may line up badly with a connection, the TSI is
an important part of the calculation.

Example:
12 hour TSI.
1 hour CPU time left on the task.
12 hours and 1 second left before deadline.
No events for the next 12 hours.
Without TSI in the calculation, there is the distinct possibility that
there is no deadline trouble recorded.
Wait 12 hours.
You how have 1 second wall time left and 1 hour CPU time left.  Your task
is now late.

With TSI in the calculation.
Deadline trouble is noted at the point 12 hours and 1 second before
deadline (if not somewhat earlier depending on other load).  The task gets
started and completes before deadline.

> 5) Work mix is not kept "interesting"
> 6) Resource Share is used in calculating run time allocations
A simulation that tracks what the machine is likely do actually do has to
track what happens based on resource share.  It may not want to be the
trigger for instant preemption though.
> 7) Work "batches" (tasks with roughly similar deadlines) are not "bank
> teller queued"
I really don't understand this one.  A bank teller queue means that tasks
come from one queue and are spread across the available resources as they
become available.  Are they always run in FIFO?  No.  However, that does
not mean that they are not coming from the same queue.
> 8) History of work scheduling is not preserved and all deadlines are
> calculated fresh each invocation.
Please explain why this is a problem?  The history of work scheduling may
have no bearing on what has to happen in the future.
> 9) True deadline peril is rare, but "false positives" are common
Methods that defer leaving RR for a long time will increase true deadline
peril.  What is needed is something in between.
> 10) Some of the sources of work peril may be caused by a defective
> work fetch allocation
Please give examples from logs.
> 11) Other factors either obscured by the above, I forgot them, or
> maybe nothing else ...
>
> > work-fetch decisions
>
> Seems to be related to:
>
> 1) Bad debt calculations
> 2) Asking for inappropriate work loads
> 3) asking for inappropriate amounts
Please give examples.
> 4) Design of client / server interactions
There are design constraints that limit the transaction to one round trip.
>
> > bad debt calculation
>
> Seems to be related to:
>
> 1) Assuming that all projects have CUDA work and asking for it
> 2) Assuming that a CUDA only project has CPU work and asking for it.
> 3) Not necessarily taking into account system width correctly
I don't understand what you mean by system width.
> 4) Not taking into account CUDA capability correctly
>
> > efficiency
> > of the scheduling calculations (if it's an issue)
>
> It is, but you and other nay-sayers don't have systems that experience
> the issues so, you and others denigrate or ignore the reports.
Fix the algorithm FIRST, optimize SECOND.
>
> The worse point is that to identify some of the problems requires
> logging, because we do, for example, resource scheduling so often the
> logs get so big they are not usable because of the sheer size because
> we are performing actions that ARE NOT NECESSARY ... because the
> assumption is that there is no cost.  But, here is a cost right here.
> If we do resource scheduling 10 times more often than needed then
> there is 10 times more data to sift.  Which is the main reason I have
> harped on SLOWING THIS DOWN.
>
> It is also why in my pseudo-code proposal I suggested that we do two
> things, one, make it switchable so that we can start with a bare bones
> "bank teller" style queuing system and only add refinements as we see
> where it does not work adequately.  Let us not add more rules than
> needed.  Start with the simplest rule set possible, run it, find
> exceptions, figure out why, fix those, move on ...
In other words step back 5 years.  We were there, and we had to add
refinements to get it to work.

Let us not throw the baby out with the bath water.
>
> SO, are you serious, or just trying to get us to go away?
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] 6.6.20 and work scheduling

Reply via email to