[boinc_dev] BOINC servers - scalability under heavy load

Richard Haselgrove Sun, 04 Nov 2012 03:53:04 -0800
The SETI@home servers are undertaking one of their periodic explorations of 
BOINC's boundary conditions, and appear to have discovered a previously-unknown 
positive feedback zone.
 
Taking my host 5828732 as an example:
 
http://setiathome.berkeley.edu/results.php?hostid=5828732
 
At the time of writing, that link shows 228 tasks in progress: but the computer 
beside me shows no SETI tasks at all. Every one of the 228 has been lost in 
transmission, with all recent RPCs (except 'report only') having ended in a 
timeout.
 
Whether this is due to network congestion or slow server assembly of the reply 
message, I'll leave to the forensic analysts to discover.
 
I'm more worried about the positive feedback loop - or vicious circle, as it is 
otherwise known.
 
Looking at the list of 228 tasks notionally 'in progress', the final 20 are 
timestamped - out of sequence - 4 Nov 2012 | 8:30:48 UTC. That's what I would 
expect to see after a 'resent lost results' event, and I would expect that 
datestamp to increment every time the host attempts a work fetch, with the 
resending of lost tasks taking precedence over the allocation of new work.
 
But since 08:30, the host has been allocated
 
08:59:01 UTC - 40 tasks
09:36:57 UTC - 19 tasks
09:44:17 UTC - 44 tasks
10:10:39 UTC - 48 tasks
 
The vast majority of these tasks appear to have been created by the workunit 
generator just seconds before being allocated to my host.
 
SETI's workunit generators ('splitters') are normally inhibited at a high water 
mark of around 300,000 'Results ready to send'. But with extra results being 
allocated to hosts, we are way below inhibition levels. Work continues to be 
generated at ~30 tasks per second.
 
With the results being allocated to hosts, the nominal 
'Results out in the field' has grown above 10,500,000 - 50% higher than any 
normal 'steady state' level. Yet volunteers report that their hosts, like mine, 
are receiving few or zero new task allocations.
 
Unless some way can be found to inhibit work generation when task allocation 
messages fail to reach their intended recipients - which the 'lost task 
mechanism seems to be failing to do, just at the moment - the database is going 
to grow unboundedly, server RPC response times will increase (causing even more 
host requests to timeout), and the whole system will eventually fall over.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
[boinc_dev] BOINC servers - scalability under heavy load

Reply via email to