In the case of big downloads the client already knows that the files 
required for BIG_WORKUNIT will not be ready in another hundred years or 
so (this could be a large constant or simply "more than x% of the client 
work-fetch period"). In the case that it runs dry it could easily fetch 
work from another project to fill out the waiting time.

So... :
if (out of work && currently downloading workunits are big and take 
forever to get){
     fetch work from a project which is not already downloading
} else {
     be happy
}

 > This prevents infinite work fetch,
Good point - the same thing is the case for the constraint that it 
should only fetch work from projects that aren't already downloading 
when it runs dry.

This will be especially relevant in the case of the initial download of 
large VMs or similar systems that require a large data workload to start up.

As don't want to damage other projects by keeping clients busy with slow 
downloads I suggest that we deploy some kind of simple immediate short 
term fix which is then later followed by the in-depth theoretical 
analysis that Paul suggests.

David, you were talking about ways of relaxing the policy?

-- Janus



On 2010-05-11 05:38, Paul D. Buck wrote:
> To my mind one of the first steps might be to consider the discussion 
> JM VII and I had just a week or so ago about the fact that work fetch 
> does not consider "packing" the processing "lanes" when assessing if 
> there is enough work on hand.  With the multi-core systems we have now 
> in almost all cases the current mode of linear adding of the tasks 
> times on hand is clearly not adequate to the task.  For example I have 
> several CPDN tasks on hand on my Mac and this bias leads BOINC to 
> think it has enough work on hand when in fact with an 8 core system, 3 
> or even 5 CPDN tasks is not sufficient to keep all of the 8 cores 
> busy.  But the essentially linear addition of the task times gives the 
> system a false sense of security that it had sufficient tasks on hand 
> time-wise to keep the system busy ... when in fact it does not ...
>
> At any rate, the discussion thread was "Queue Size" of April 17 to 19 
> with John's answer of the 19th probably the best summation of the 
> problem and a solution.
>
> The case of long downloads I have seen recently myself with BURP where 
> the DL was inching along for several days before failure.  I was 
> fortunate enough that I did not fail for work, but, if the queue is 
> too small then it is easy to get into trouble.
>
> However, when you add in GPU processing and you have whole 'nother set 
> of issues... especially in "mixed" systems where, as I have, there can 
> be CPU, CUDA, *AND* ATI resources in the same system.  I reported some 
> while ago the issue I saw similar to this where the system would run 
> my GPU queue out and only then begin to search for new work.  When the 
> new queue was filled I would be good to go until the last batch 
> obtained was drained.  **THIS** version of the problem reported below 
> and in my prior report "6.10.32 Idle GPU in dual GPU system" Feb 18 
> may be one and the same...
>
> In the face of multi-project GPU usage we also have the collision of 
> the Strict FIFO rule with project server side instability invalidating 
> effective RS balancing and proper queuing of tasks for server side 
> outages.  I don't think Richard is seeing this because he runs with a 
> fairly large queue size and against projects that normally supply a 
> generous supply of work (SaH and GPU Grid where the problem is more 
> commonly seen with MW vs. Collatz). And I am not sure the system 
> adequately addresses the issues with multi-resource use tasks 
> (Einstein for example and SaH AP tasks for another) where there is 
> simultaneous hight usage of a GPU and a CPU core...
>
> But the real solution I feel is otherwise indicated...
>
> I know it is not a popular take on the issue(s), but for the past year 
> or so quite a few of us (myself, Richard, JM VII, and others) have 
> been hitting on the deficiencies of the triad of RR Sim, Resource 
> Scheduler, and Work Fetch modules, attempting to highlight all of the 
> different ways that they are failing to operate correctly.  This 
> latest repot by Janus is merely one more issue on that whole pile.
>
> As is Richards recent post on STD "leaking" and a probable imbalance 
> in RS calculation ... and the many many others of the past few months ...
>
> Perhaps the real solution would be to go back and review this history 
> and then to attempt to devise a more comprehensive repair strategy and 
> then to do a more fundamental tuning of the triad.  And, in the mean 
> time, try to fix the reporting so that the debug dumps would be more 
> comprehensible ... I, for one, can hardly understand the debug outputs 
> both because of the basic formatting (for which I submitted a 
> suggested change, not yet implemented, my thread "Changeset 21335"on 
> the Alpha list) and for the data content which seems to become more 
> confusing with each revision adding data outputs to each debug 
> statement...
>
> WIth a more general overhaul of these modules including an update of 
> the reporting of debug data we could then go back to the prior cases 
> and use them to prove that we have in fact cured these issues...  Of 
> course for this to work there has to be a general agreement that 
> fundamental change may be the order of the day, otherwise there may be 
> little point to starting ...
>
> On May 10, 2010, at 9:17 AM, David Anderson wrote:
>
>> What you describe is how things work currently;
>> the work fetch policy treats downloading jobs as if they were downloaded.
>> This prevents infinite work fetch,
>> but as you point out it can lead to processor starvation.
>> If BOINC is to be used for truly data-intensive problems we'll need
>> to address this issue.
>>
>> I can think of some ways to relax the policy,
>> but they're a little tricky.
>> If anyone has a good idea let me know.
>>
>> -- David
>>
>> Janus Kristensen wrote:
>>> Hey all
>>>
>>> I'm forwarding a report concerning the client and possible issues with
>>> lengthy or failing downloads. I haven't had the time to verify it so
>>> just ignore it if this is not (or no longer is) the case.
>>>
>>> Situation:
>>> 1) Client attached to multiple projects
>>> 2) Cache set to X days of work
>>> 3) A project releases workunits with downloads that take around X days
>>> to get (slow, big or failing, the issue is naturally more pressing for
>>> clients with smaller values of X.)
>>> 4) Client is registered to some WUs and starts downloading - the
>>> scheduling mechanism is satisfied
>>> 5) Client eventually runs out of work and stalls until the download
>>> completes or fails entirely.
>>>
>>> Problem:
>>> The client isn't doing anything. CPU utilization is 0%.
>>>
>>> Expected behaviour:
>>> The client would fetch and work on something else while completing the
>>> lengthy download.
>>>
>>> Difficulties:
>>> Detecting the situation and determining the course of action (finding
>>> alternative work).
>>>
>>>
>>> -- Janus
>>>
>>>
>>> _______________________________________________
>>> boinc_dev mailing list
>>> [email protected] <mailto:[email protected]>
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>>
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected] <mailto:[email protected]>
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
>

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to