Re: [boinc_dev] 6.6.20 and work scheduling

John . McLeod Thu, 30 Apr 2009 09:15:14 -0700

I have read the code in great detail.  The first loop is an attempt to
initialize a variable to a known state.  The state is changed later as
needed.


jm7


                                                                           
             "Paul D. Buck"                                                
             <p.d.b...@comcast                                             
             .net>                                                      To 
                                       BOINC Developers Mailing List       
             04/30/2009 12:05          <[email protected]>        
             PM                                                         cc 
                                       David Anderson                      
                                       <[email protected]>,           
                                       [email protected]              
                                                                   Subject 
                                       Re: [boinc_dev] 6.6.20 and work     
                                       scheduling                          
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           





On Apr 30, 2009, at 5:48 AM, [email protected] wrote:



      jm7


            1) We do it too often (event driven)
      Exactly what we are not listening to.  The rate of tests is NOT the
      reason
      for incorrect switches.

No, but it is the reason we have difficulty finding them.  And it is a
source of instability. John, you can stick your head in the sand all you
want, it will not make the problems go away because you refuse to see them.

Also, because you consider "globally" and the universe is different every
time you recalculate, because you also recalculate based on the situation
as it is NOW, and that situation evolves if for no other reason than work
is done in the mean time, well, if you are doing that 10 times a minute you
are going to get 10 different answers.  Those answers MAY be close enough
that no change is needed based on the rules as they are, but, coupled with
the other limitations this is an issue.

I did show a very specific example of this effect where task A completes, B
starts, A uploads and completes the upload, B suspended, C started, Another
task D completes, E started, D's upload completes, E Suspended and F is
started.

The last item I will remind you once again.  I may not be able to walk
straight anymore, and I sometime have trouble talking, but, I am a trained
and skilled systems engineer.  This is what I used to do.  I know I cannot
put my finger on a line in a log to convince you or anyone else, But, this
is a problem.  It is a problem because it loads up the logs with unneeded
entries and it is also a cause of some of the instability we see.

Anyone that works with unstable systems knows that bumping an unstable
system causes problems, the more you bump it the faster those problems
arise.

            2) All currently running tasks are eligible for preemption
      Not completely true, and not the problem.  Tasks that are not in the
      list
      of want to run are preemptable, tasks that are in the list of want to
      run
      are preemptable.  They should only be preempted if either that task
      is past
      its TSI, or there is a task with deadline trouble (please work with
      me on
      the definition of deadline trouble).

Which means you have not looked at the code.  The first loop in the code
marks the next state of ALL running tasks as preempted.  Dr. Anderson made
a change that was supposed to cure that, but it does not.

            3) TSI is not respected as a limiting factor
      It cannot be in all cases.  There may be more cases where the TSI
      could be
      honored.

For the reason above, it is not honored at all.  I have pointed to the
block of code where all tasks are marked for preemption and that my friend
means that TSI is not considered at all ...

Again, you are thinking in terms of single stream systems and on those I
agree that this is the case.  On multi-core systems it is much less of an
issue to the point where it might never be an issue at all.

8 Core system
all tasks running are 8 Hours in length

Average time between task completions: 1 hour

Assuming that the system has been running for awhile that is what
statistics tells me.

With the mix of task lengths I see on my systems the situation is usually
much better than that.  See the numbers below.  One of my first posts I
actually listed the numbers of tasks and the run times ... but the nubmers
below are illustrative enough.

            4) TSI is used in calculating deadline peril
      And it has to be.  Since tasks may (or may not) be re-scheduled at
      all
      during a TSI, and the TSI may line up badly with a connection, the
      TSI is
      an important part of the calculation.

      Example:
      12 hour TSI.
      1 hour CPU time left on the task.
      12 hours and 1 second left before deadline.
      No events for the next 12 hours.
      Without TSI in the calculation, there is the distinct possibility
      that
      there is no deadline trouble recorded.
      Wait 12 hours.
      You how have 1 second wall time left and 1 hour CPU time left.  Your
      task
      is now late.

      With TSI in the calculation.
      Deadline trouble is noted at the point 12 hours and 1 second before
      deadline (if not somewhat earlier depending on other load).  The task
      gets
      started and completes before deadline.

Proving once again you are thinking of systems that are running a single
processing stream.  I suppose that you forgot my last test where you did
not want to read the numbers.  Or the test before that.  In the first test
the average time to completion between tasks run off was 6 minutes
(measured over 24 hours), on the other test there were:

Request CPU reschedule:                  3       11      14      22     19

handle_finished_apps

In a three hour period.  Those numbers were for a 4, 4, 8, 4, 8 CPU system
respectfully.  Over a 3 hour period.  Meaning that the time between a
completed task and the next was at worst 60 minutes and at best about 8
minutes apart (6 minutes for the first test).  Your theory falls apart
because when the next task completes the pending task can be picked up and
scheduled next.

We are not talking about scheduling problems on single core systems.  It
would be nice if you would keep that in mind.  We are talking about the use
of parameters to control the scheduling that were developed on single
thread systems being inappropriate on multi-core systems.

            5) Work mix is not kept "interesting"
            6) Resource Share is used in calculating run time allocations
      A simulation that tracks what the machine is likely do actually do
      has to
      track what happens based on resource share.  It may not want to be
      the
      trigger for instant preemption though.

Sadly it does do that right now, trigger preemption at the slightest
breeze.  Last night I had 5 uFluids tasks all running in parallel because
the scheduler decided that the deadline of 5/13 could not be met.  It ran
those tasks for several hours before I suspended most of them.  Later it
suspended the one it was still running and late last night I unsuspended
all of them again.  They are STILL waiting to be restarted.  Because they
have deadlines that are close the mechanisms used to "globally" calculate
will always select these tasks in batches and screw up the work mix, which
means that my i7 is run in a mode that is significantly less efficient.

This is also why I have proposed other metrics and rules to make these
decisions to lower the driving by Resource Share on the selection process.

            7) Work "batches" (tasks with roughly similar deadlines) are
            not "bank
            teller queued"
      I really don't understand this one.  A bank teller queue means that
      tasks
      come from one queue and are spread across the available resources as
      they
      become available.  Are they always run in FIFO?  No.  However, that
      does
      not mean that they are not coming from the same queue.

Probably because you keep refusing to read what I write carefully.  See the
example above.  If you schedule "globally" as you so love, then tasks with
close deadlines and relatively low Resource Shares will always cause these
panics.  I get them for IBERCIVIS, VTU, and just recently uFluids

            8) History of work scheduling is not preserved and all
            deadlines are
            calculated fresh each invocation.
      Please explain why this is a problem?  The history of work scheduling
      may
      have no bearing on what has to happen in the future.

See above.  It also leads to other instabilities that you don't want to
recognize.  When I re-enabled the uFluid tasks that were such a cause for
panic yesterday it sure would seem that it should be a cause for panic
today.  I have a NQueens task that was suspended yesterday with 12 minutes
to run and it still has not restarted.  If it was so important to start
yesterday to run up to that point, why, 24 hours later has BOINC been
running off task from projects that it has just downloaded work from that
have deadlines that are later?

            9) True deadline peril is rare, but "false positives" are
            common
      Methods that defer leaving RR for a long time will increase true
      deadline
      peril.  What is needed is something in between.

Again, the systems of which we speak tend to be completing tasks fast
enough that this argument makes no sense.  With resources coming free in
minutes, on average, there is no chance that this is going to be as common
as you posit.  Again, and again, you are thinking of the old slow systems
and when you refuse to consider the evidence that people like Richard and I
supply, well ...

I know it is harder to see on a 4 core system.  Though I did notice these
issues in 2005 after I had gotten my first 4 CPU system (the first two of
the test above), but, you can see it if you watch the patterns of
operation.

            10) Some of the sources of work peril may be caused by a
            defective
            work fetch allocation
      Please give examples from logs.

I don't have to.  You have described over and over again why every
suggested change cannot work because of these very issues.  Go back and
look at your examples.  Virtually all your examples involve BOINC
downloading work that all of a sudden causes this magical situation where I
have to madly start processing the new work because BOINC fetched something
that causes the world to change.  Ergo, if BOINC had not fetched that work,
the problem would not have occurred and the universe would not be ending.

Even so, many of those examples of panics are still modeled on only having
a single stream of work processing.

            11) Other factors either obscured by the above, I forgot them,
            or
            maybe nothing else ...

                  work-fetch decisions

            Seems to be related to:

            1) Bad debt calculations
            2) Asking for inappropriate work loads
            3) asking for inappropriate amounts
      Please give examples.

I have, any number of times.

I could send you another long log showing that the CUDA debt is slowly
building and in another 24 hours or so is going to be so out of whack that
the client is going to stop asking for work from GPU Grid the only project
from which  GPU work can be fetched, and BOINC is still happily ignoring
all evidence to the contrary and trying to get CUDA work from every other
project in the universe and pouting because it cannot get it.  There is the
Rosetta guy who cannot get a queue full of Rosetta work because of the
opposite problem (he is only attached to GPU Grid and Rosetta), there is
Richard's logs where he needs one class of work in one part and the work
fetch asks for the wrong kind of work.

Others have mentioned this before, but the next is where I ask for 1 second
of work and instead of getting one task I get 10 or more, or even more than
one.  This is a long standing problem and the issue is on the server end,
but, it is still a problem

            4) Design of client / server interactions
      There are design constraints that limit the transaction to one round
      trip.

Actually they are design choices.  And they may or may not be the best
choices.  One of the recent examples and questions was why we feed up the
list of tasks to the server each time.  Another design choice.  The server
is supposed to use that information to make a good choice to feed work
down.  If I understand the other proposal made recently changes could be
made to this exchange that might be beneficial.  Changes which you have
also rejected out of hand.


                  bad debt calculation

            Seems to be related to:

            1) Assuming that all projects have CUDA work and asking for it
            2) Assuming that a CUDA only project has CPU work and asking
            for it.
            3) Not necessarily taking into account system width correctly
      I don't understand what you mean by system width.

More modern systems are faster, they are also "wider" with more processing
units.  My i7 has 12 with 8 virtual CPUs and 4 GPU engines.  I am actively
considering a system with 16 CPUs with room for as many as 6 or 8 GPU cores
which could bring that number up to 24 elements.  As I have been struggling
to get through that this changes the way work can be processed I have been
using this term, a lot.  Which tells me yet again that you have not
actually been carefully reading what I have been writing.

I know it is a PITA to read things carefully, but, I am not wordy out of
spite, but to be as clear as possible.  Skimming proposals looking only for
reasons to reject them is not actually that helpful.

            4) Not taking into account CUDA capability correctly

                  efficiency
                  of the scheduling calculations (if it's an issue)

            It is, but you and other nay-sayers don't have systems that
            experience
            the issues so, you and others denigrate or ignore the reports.
      Fix the algorithm FIRST, optimize SECOND.

Reducing the hit rate is not intended to be done to optimize anything.
Sadly this is a point that I know I will never be able to prove to your
satisfaction, and it is apparent that I cannot explain it well though I
have tried very hard to do so.  But, even with a perfect rule set, the
system will retain the characteristic of instability if we keep calling the
scheduler at times when there is no specific need.  I get why some of those
calls are made, but, the way we proceed from there is the secondary cause.

And when I suggest that there may not be specific needs you have made
examples time and again where work is downloaded and for some reason cannot
quite grasp the fact that in most cases waiting for 30 seconds before we
check to see how the schedule might be affected by this new work insist
that the world is magically better if I check it instantaneously.  With no
evidence I might add.  Even you defunct project with 5 minute deadlines
would only be affected if the tasks took 4 minutes and 59 seconds ... which
means they would also blow the deadlines because of the latency in uploads
and downloads.  If the task were a reasonable 1 minute in length then the
only effect of waiting to schedule the task by 30 seconds would be to trim
the margin slightly.

But the more cogent point is you are offering a straw-man argument using a
project that essentially collapsed because they had unreasonable
requirements.  So, why are we coding BOINC to handle unreasonable
requirements from a project that does not exist anymore?  That is a poser I
cannot fathom.

The fact that reducing the call rate has the side effect of increasing
efficiency is a nice side effect. But it is not the reason I have proposed
it, and I wish you would stop pretending that it is.

In either case, the two main reasons to reduce the call rate are:

a) to lower the log clutter
b) reduce the rate of false changes so they are easier to identify

Your intransigence on this matter is nothing short of amazing.  You
complain about the large logs that obscure the very problems we are hunting
and yet denigrate the one way we can start to get a handle on that very
issue.

            The worse point is that to identify some of the problems
            requires
            logging, because we do, for example, resource scheduling so
            often the
            logs get so big they are not usable because of the sheer size
            because
            we are performing actions that ARE NOT NECESSARY ... because
            the
            assumption is that there is no cost.  But, here is a cost right
            here.
            If we do resource scheduling 10 times more often than needed
            then
            there is 10 times more data to sift.  Which is the main reason
            I have
            harped on SLOWING THIS DOWN.

            It is also why in my pseudo-code proposal I suggested that we
            do two
            things, one, make it switchable so that we can start with a
            bare bones
            "bank teller" style queuing system and only add refinements as
            we see
            where it does not work adequately.  Let us not add more rules
            than
            needed.  Start with the simplest rule set possible, run it,
            find
            exceptions, figure out why, fix those, move on ...
      In other words step back 5 years.  We were there, and we had to add
      refinements to get it to work.

See, that is the way we fixed it then, why are you so resistant to this
approach now?  Back then the most common system was single core, with some
duals.  And, as I point out that was the time I started to notice these
issues on my 4 core system.  Those issues were not handled back then and
they are worse now ...

So lets try a new mechanism for the wide systems with as few rules as
possible and see if it works.  If we can create situations where it starts
to fail, well, then we add complexity.

I suspect that many of the rules we have now will not be needed at all.  In
fact, I think that much of the complexity can go away because now we can
make choices that are not at all possible on single processing thread
machines.

      Let us not throw the baby out with the bath water.

If the baby is dead, why not?

The problem is fundamentally that we developed elaborate rules to handle
scheduling on single processing thread machines.  Duals made some of those
rules passe but the effects were almost unnoticible.  The effects started
to become visible on 4 core systems and are now quite obvious on wider
systems.

This is one reason in my psuedo code I suggested that at least for the time
being we keep the current scheduler for systems of less than 4 cores and
try something new on the 4 and wider systems.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] 6.6.20 and work scheduling

Reply via email to