Log files with no comment, and no attempt at finding the problem are not
useful. I did read through one of your log files in excruciating detail,
only to discover that the only problem area was during the time where debug
logs were turned off. I am NOT spending a couple of hours doing that
again.
Yes, you keep giving reasons for increasing the time between the tests.
NOBODY ELSE CAN UNDERSTAND THEM.
jm7
"Paul D. Buck"
<p.d.b...@comcast
.net> To
Sent by: [email protected]
boinc_dev-bounces cc
@ssl.berkeley.edu TarotApprentice
<[email protected]>, BOINC
dev <[email protected]>
04/27/2009 05:20 Subject
PM Re: [boinc_dev] 6.6.20 and work
scheduling
On Apr 27, 2009, at 12:33 PM, [email protected] wrote:
> There is a history of the messages kept in stdoutdae.txt. And you can
> increase or decrease the size of the history by using flags in
> cc_config.xml.
And you don't like 8 M log files... I have not changed the limits nor
needed to yet. Does not matter, you won't look in the logs either
because you know it is a waste of time.
> You have yet to come up with a single good reason why the frequency of
> calls to the check is a problem. You keep complaining about the task
> switches that happen to frequently, and you keep stating that if the
> test
> were slowed down that would fix the problem. This is a complete
> non-sequitur as far as everyone else can determine. You may be able
> to see
> this as a solution, nobody else can with the information you have
> provided.
Actually I have given several. But, if you don't READ what I write
you won't see them.
I have NEVER said that making this one change WILL CURE ALL OF THE
PROBLEM. On that we agree.
From the message you replied to, I quote:
> I agree that changing the frequency is not going to solve this, but
> maybe it will allow me to help provide the data so we can solve the
READ the whole message.
Not just the parts you agree with, or want to dispute.
> I believe that part of the confusion is that there are two distinct
> tests
> that are run at different times.
>
> 1) What should we be working on if we did a task switch now? This
> detects
> that cases where we need to trigger a task switch immediately
> because of a
> potential missed deadline. This has many triggers that are typically
> spaced well apart. File Download Complete, Server RPC complete,
> Detach,
> Project Suspend, Project Resume, Task Suspend, Task Resume, Task
> Complete,
> Task Abort, X time after the last previous event... None of these
> happen
> that often.
WRONG!
11:35 4-26 to 1:40 4-27, is what, 26 hours (1560 minutes):
File download complete: 328 occurrences
Server RPC complete: 871 occurrences
Task Resume: 149 Occurrences
Starting Task: 120 Occurrences
Computation Completed: 117
I add just those up to 1585 ... or more than one a minute. And I will
note that my system is running quiet without FreeHAL, VP, or MW or
actually a couple other projects that would raise those numbers
significantly. Add in checkpoint triggers and I am back to insanity.
What I see is that you are looking at a slower system and extending
your concepts. I am looking at a real system. Please stop telling me
to not believe the report of my lying eyes because you can see my
system better than I can.
I am sure that I missed some triggers because I only had on sched op
debug and only for part of the time. I will save the log, though
there is no way I am going to attempt to trim it. If you don't want
to believe my numbers I will send the log and you can count them. If
you don't like long logs, well, lets cut out the unneeded tests so we
are doing real work and the logs contain real information and not
spurious nonsense.
By the way, the 1585 added to the 1560 means that I am doing this test
just based on these numbers once every 30 seconds. Add in the
checkpoints and I am back to my earlier reports of once every 10
seconds or so.
And, I think that it is also triggered on file uploads also... if so,
another 293 events.
And were I running SaH vice GPU Grid it would be worse by as much as
an order of magnitude. In place of 20 tasks I would have completed 416
(416 starts, 416 ends, 416 more downloads, how many more RPCs?) more
tasks as a conservative estimate or another 1248 events.
> 2) Should we do a task switch now? This one gets run extremely
> frequently
> as one of the triggers is a checkpoint. The enforcement routine is
> then
> supposed to check to see if there are any tasks that have gone over
> their
> time segment and can be swapped out normally. The checkpoint
> trigger was
> put in place so that tasks would not lose a large amount of
> processing time
> due to being swapped out just before a checkpoint was to occur.
> Another
> trigger is the "what should we be working on now" detecting a deadline
> problem.
>
> What we need to do is to figure out which of these two is causing the
> problem.
>
> I am open hearing about problems with the algorithm, however, you keep
> hammering on about one place that makes no sense at all.
And you keep ignoring what I am saying.
Changing the frequency in and of itself will not make the systems more
stable. Not changing it will not eliminate the instability either.
Or to put it another way, fix all the rules and run this nonesense
this often and the systems like mine will still be unstable.
But if we reduce the unneeded work maybe I can start to find things of
meaning in the logs.
I have a more formal proposal as to the contributing factors and the
basic approach I would take.
And just because we can do something does not mean we SHOULD do
something. Even if we are doing it correctly. As I recall you don't
want to run tests I think are needed because you don't want to waste
time. So, why are you so eager to waste time here? Regardless of how
small or large, it is still unnecessary and a waste.
> On Apr 27, 2009, at 10:14 AM, [email protected] wrote:
>
>> As long as you insist on talking about the frequency of the test,
>> people
>> are going to be ignoring you. Please start talking about what is
>> wrong
>> with the test itself. Fixing the test will fix the problem. No
>> amount of
>> tinkering with the frequency of the test is going to fix the problem.
>>
>> The project came and went already. It was doing document indexing.
>> The
>> runtimes were very short, but the transfer times were killing it.
>
> Which proves my point. The deadlines were unrealistic.
>
> Yes, most of the real issue are that the rules that make the test
> bad. But that is not the sole problem here.
>
> The frequency means that I cannot help you troubleshoot the rules
> because I have hundreds to thousands of calls to the routines that are
> just so much wasted time. This buries the bad calls is so much
> garbage that I cannot find that needle that is needed to fix the
> tests.
>
> And, just because you don't think the frequency of the tests, or even
> Dr. Anderson not thinking the frequency of the test is a problem does
> not mean that it is not a problem.
>
> Which *IS* also one of the problems in the BOINC world. We ignore
> people that ask questions we don't want asked. We avoid opinions that
> don't comport with ours ...
>
> So, we had a project that had an unrealistic deadline and that we put
> into place this rule and because that one project had a mythical need
> and that means we now cannot change BOINC for the better?
>
> What is wrong with the test is that we do it too often. We also use
> the wrong driving parameters. Because we do it so often, and keep no
> history, we have instability in the scheduling system and no
> pretending that the frequency does not matter is not going to make it
> more stable. Even if you fix the rules the fact that the client is
> recalculating the deadlines every 10 seconds (or less) means that
> BOINC is going to change its mind as to what to run. Because we also
> don't enforce TSI ... This is not a simple one minor butlet and we
> are done ...
>
> You cannot, or will not, see the frequency caused instability unless
> you have a system that is both fast and wide. As best as I can tell
> you have neither, nor does UCB, though they are welcome to drive over
> anytime to look at mine (2 hours or so from UCB, and I will buy lunch
> and pay for the gas).
>
>
>
> Ok, we fix all other problems but still check every 10 seconds on
> which tasks to run. If we do not enforce TSI, meaning, you cannot
> switch a task out until it has completed its TSI or ended (a rule you
> also say should not be enforced), that means that assuming that I have
> a batch of tasks that are from a project, all have roughly the same
> deadlines, well are we not going to enforce "keep work mix
> interesting"? Then I am going to run that as a big batch which will
> cause task abandonment ... oh, and because we are keeping the event
> driven basis that means that the task I started because a task ended
> is still going to be superseded by another task when the upload
> ends ... leading to more tasks abandoned partly done ...
>
> Essentially you want to fix the problem without changing any of the
> drivers of the problem... one of which is the event base triggers ...
> which happen far too often... and pretending that they don't won't
> make it less of a problem.
>
> So, ignore me some more if you want, why not, everybody else does...
> still does not mean that I am wrong ... I first reported this problem
> in 2005 or there abouts ... it is still a problem ... and it will
> continue to be a problem unless you stop clinging to "I don't think
> doing it once a second is a problem, so it cannot be a problem"
> mindset. Even it we change the rules to better ones the fact that
> fast systems run the tests so often are still going to be unstable.
> BOINC ignores history ... that and fast repeats of any test is a
> recipe for instability ...
>
> I agree that changing the frequency is not going to solve this, but
> maybe it will allow me to help provide the data so we can solve the
> rest of the problems. And changing the frequency of the tests will
> make the system a little less unstable. Oh, and save compute time.
> Oh, one more thing, running the test every 60 seconds with a 2 minute
> task means I would still make the deadlines... no need to run the test
> RIGHT NOW ... 30 seconds later would not be a killer ... even for a
> mythical need ...
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.