Re: [boinc_dev] 6.6.20 and work scheduling

John Sanborn Mon, 27 Apr 2009 15:48:07 -0700

Just trying to do some forward thinking here in conceptual terms.  Let's assume 
both processing speed and CPUs are doubling in time factor X.

If we start with an average machine from the past, which was a single processor 
and it hit a 'testable event' (defined anyway you want) every ten minutes (600 
seconds).
A current machine with dual processors would now hit an event on average in 2.5 
minutes (180 seconds) (2x as fast processor = 5 min x 2 processors = 2.5 min)
A future system with quad processors would hit an event on average every 37 
seconds (4x as fast processor =2.5 min x 4 processors = .625 min)
And on into the future ... (and that doesn't even bring in those who like to be 
on the bleeding edge of the latest & greatest ... who BTW would be the first to 
see what us average Joes will be seeing the near future)

So if my concepts are right, there is an exponential shrinkage in time between 
the events, that at some point the law of diminishing returns begins to kicks 
in.  Even if the check takes only a fraction of a second, at some point there 
will be machines where at least one of it's processors will be hitting it all 
the time.

How best to deal with it?? I don't know, I'm no programer or systems designer, 
but fixing the problem with the checking routine will certainly help.  But I 
think that'll only delay getting to the point, of having to skip checks, and 
'assume' that nothing has changed in the last X minutes/ seconds/ nano-seconds 
(pick your own time-frame).

John

----- Original Message ----
From: "[email protected]" <[email protected]>
To: Paul D. Buck <[email protected]>
Cc: TarotApprentice <[email protected]>; BOINC dev 
<[email protected]>; [email protected]
Sent: Monday, April 27, 2009 12:33:51 PM
Subject: Re: [boinc_dev] 6.6.20 and work scheduling

There is a history of the messages kept in stdoutdae.txt.  And you can
increase or decrease the size of the history by using flags in
cc_config.xml.

You have yet to come up with a single good reason why the frequency of
calls to the check is a problem.  You keep complaining about the task
switches that happen to frequently, and you keep stating that if the test
were slowed down that would fix the problem.  This is a complete
non-sequitur as far as everyone else can determine.  You may be able to see
this as a solution, nobody else can with the information you have provided.

I believe that part of the confusion is that there are two distinct tests
that are run at different times.

1)  What should we be working on if we did a task switch now? This detects
that cases where we need to trigger a task switch immediately because of a
potential missed deadline.  This has many triggers that are typically
spaced well apart.  File Download Complete, Server RPC complete, Detach,
Project Suspend, Project Resume, Task Suspend, Task Resume, Task Complete,
Task Abort, X time after the last previous event...  None of these happen
that often.
2)  Should we do a task switch now?  This one gets run extremely frequently
as one of the triggers is a checkpoint.  The enforcement routine is then
supposed to check to see if there are any tasks that have gone over their
time segment and can be swapped out normally.  The checkpoint trigger was
put in place so that tasks would not lose a large amount of processing time
due to being swapped out just before a checkpoint was to occur.  Another
trigger is the "what should we be working on now" detecting a deadline
problem.

What we need to do is to figure out which of these two is causing the
problem.

I am open hearing about problems with the algorithm, however, you keep
hammering on about one place that makes no sense at all.

jm7

            "Paul D. Buck"                                                
            <p.d.b...@comcast                                            
            .net>                                                      To 
            Sent by:                  [email protected]              
            boinc_dev-bounces                                          cc 
            @ssl.berkeley.edu        TarotApprentice                    
                                      <[email protected]>, BOINC  
                                      dev <[email protected]>,  
            04/27/2009 02:30          [email protected]  
            PM                                                    Subject 
                                      Re: [boinc_dev] 6.6.20 and work    
                                      scheduling                          

On Apr 27, 2009, at 10:14 AM, [email protected] wrote:

> As long as you insist on talking about the frequency of the test,
> people
> are going to be ignoring you.  Please start talking about what is
> wrong
> with the test itself.  Fixing the test will fix the problem.  No
> amount of
> tinkering with the frequency of the test is going to fix the problem.
>
> The project came and went already.  It was doing document indexing.
> The
> runtimes were very short, but the transfer times were killing it.

Which proves my point.  The deadlines were unrealistic.

Yes, most of the real issue are that the rules that make the test
bad.  But that is not the sole problem here.

The frequency means that I cannot help you troubleshoot the rules
because I have hundreds to thousands of calls to the routines that are
just so much wasted time.  This buries the bad calls is so much
garbage that I cannot find that needle that is needed to fix the tests.

And, just because you don't think the frequency of the tests, or even
Dr. Anderson not thinking the frequency of the test is a problem does
not mean that it is not a problem.

Which *IS* also one of the problems in the BOINC world.  We ignore
people that ask questions we don't want asked.  We avoid opinions that
don't comport with ours ...

So, we had a project that had an unrealistic deadline and that we put
into place this rule and because that one project had a mythical need
and that means we now cannot change BOINC for the better?

What is wrong with the test is that we do it too often.  We also use
the wrong driving parameters.  Because we do it so often, and keep no
history, we have instability in the scheduling system and no
pretending that the frequency does not matter is not going to make it
more stable.  Even if you fix the rules the fact that the client is
recalculating the deadlines every 10 seconds (or less) means that
BOINC is going to change its mind as to what to run.  Because we also
don't enforce TSI ...  This is not a simple one minor butlet and we
are done ...

You cannot, or will not, see the frequency caused instability unless
you have a system that is both fast and wide.  As best as I can tell
you have neither, nor does UCB, though they are welcome to drive over
anytime to look at mine (2 hours or so from UCB, and I will buy lunch
and pay for the gas).

Ok, we fix all other problems but still check every 10 seconds on
which tasks to run.  If we do not enforce TSI, meaning, you cannot
switch a task out until it has completed its TSI or ended (a rule you
also say should not be enforced), that means that assuming that I have
a batch of tasks that are from a project, all have roughly the same
deadlines, well are we not going to enforce "keep work mix
interesting"? Then I am going to run that as a big batch which will
cause task abandonment ... oh, and because we are keeping the event
driven basis that means that the task I started because a task ended
is still going to be superseded by another task when the upload
ends ... leading to more tasks abandoned partly done ...

Essentially you want to fix the problem without changing any of the
drivers of the problem... one of which is the event base triggers ...
which happen far too often... and pretending that they don't won't
make it less of a problem.

So, ignore me some more if you want, why not, everybody else does...
still does not mean that I am wrong ... I first reported this problem
in 2005 or there abouts ... it is still a problem ... and it will
continue to be a problem unless you stop clinging to "I don't think
doing it once a second is a problem, so it cannot be a problem"
mindset.  Even it we change the rules to better ones the fact that
fast systems run the tests so often are still going to be unstable.
BOINC ignores history ... that and fast repeats of any test is a
recipe for instability ...

I agree that changing the frequency is not going to solve this, but
maybe it will allow me to help provide the data so we can solve the
rest of the problems.  And changing the frequency of the tests will
make the system a little less unstable.  Oh, and save compute time.
Oh, one more thing, running the test every 60 seconds with a 2 minute
task means I would still make the deadlines... no need to run the test
RIGHT NOW ... 30 seconds later would not be a killer ... even for a
mythical need ...
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] 6.6.20 and work scheduling

Reply via email to