The problem, stated simply, is that BOINC is (unintentionally) a BotNet. It is a set of clients installed on thousands and thousands of machines, under control of a single individual.
It can do whatever the BotNet owner tells it to do (through commands to the Command-and-control server). Trouble is, the only command is "do this work, and return the result" and at some future time, that work has to come back. When it does, the BOINC client connects back to the server, and if enough connect back all at once, we have a Distributed Denial of Service attack -- against the project's server. I think the solution is to finish the BotNet command set, so that the BotNet owner can tell the Bots to turn off the DoS attack. I think the best way is to limit connections, on the CLIENT SIDE, by telling the bots how often they should try to connect. If you have 25 simultaneous connections, you'll get a certain data rate in and out of the servers. If you have 50 simultaneous connections, the data rate will either be proportionally higher, or the same. If you have 10000 simultaneous connections, the data rate might be effectively zero. ... and if I can get a development environment going, I'll be happy to contribute the code. On the upload optimization: in a smaller sense, this is one place where reshuffling the priorities can bring through an automatic reduction in the total number of connections hitting the server without having to create a "command channel" from the servers to the BotNet. The only danger is when you start skipping uploads -- Murphy's Law says that, left to chance, the only successful upload will be the one with the farthest deadline. -- Lynn Martin wrote: > In summary: > > Always use upload ordering, EDF or RR dependant on whether the last > _transfer_ was successful; > > The present Boinc exponential backoff *maintains* the *overload* problem > and creates a DDOS; > > Possible improvements to the exisiting backoff scheme are suggested, > BUT... ; > > Must /also/ add dynamic data rate limiting client-side to avoid > saturating a project link bandwidth; > > Overload wastes available bandwidth and worsens/lengthens the "DDOS", > the overload causes a 'disgraceful' degradation of service; > > > More detailed comments inlined below: > > Regards, > Martin > > > > Lynn W. Taylor wrote: >> If uploads are working, any order will do. No reason to sort. > > Indeed so assuming no prior interruption and infinitely fast uploads. > > However, note that for large uploads over a slow uplink, the upload can > take many minutes. I have various CPDN tasks with the Boinc upload > deliberately restricted so as to not interfere with home internet usage. > A CPDN upload can take well over an hour. > > If you have programmed in upload order sorting, you may aswel have it > always enabled. No need then for additional checks to decide whether or > not to use a sorted upload order. Just simply always do an ordered upload. > > >> The only time you need to sort is if you're going to slow down the >> retries (so you don't keep hammering the servers to death). > > Avoid the test for whether to sort or not. Simply always assume the > worst and so always do the best possible upload order. > > >> Stop hammering servers, reduce the load, and things will go much more >> smoothly. Recovery after an outage will be faster. > > That is what the exponential backoff *does not* attempt to do. > > There is no feedback to dynamically avoid a DDOS. Also, the exponential > backoff only takes effect long long after there are problems and then it > *maintains* a level of *overload* ! Note that the present backoff > mechanism will immediately reinstate a DDOS as soon as any overload > condition is seen to ease and thereby perpetually re-establishes the > overload. > > I guess this is because the intention of the exponential backoff was to > allow for complete loss of service when a project goes completely > offline. The present scheme is not actually designed for network load > management or service load management. > > (I guess Matt at s...@h has been chasing ghosts on the server settings when > in fact the real problem and control is in the routers at either end of > their 100Mb/s bottleneck. Loose the ACKs/ICMP on a link and you see all > manner of weird 'strangeness' that by it's very nature you have no > control over. It all goes 'random'...) > > > Rather than an exponential backoff that gets reset to zero immediately > upon a successful connection, instead a bodge-fix could be to use the > present exponential backoff but when a connection succeeds, then start a > slow-start exponential decay from that backoff time. (Hence holding the > backoff and low connection attempts rate for some time to let the > overload subside.) Also upload/download only one WU at a time until no > further connection failures are seen, again to avoid recreating an > overload. > > (And yes, singular uploads are more expensive than multiple uploads, but > just one mega-cruncher would swamp and DOS the project otherwise.) > > Then eventually return to the normal scheme of upload as much as > possible in "one big bang" one the backoff has successfully decayed back > to zero. > > The idea there is that the backoff time is maintained during, and for > some time afterwards, so that a project server overload can clear > without creating a perpetual re-DDOSing. > > > Really, the Boinc server should dynamically retune the client backoff > parameters for each project to match the expected network load vs the > network bandwidth available. > > >> But, if you do slow the client way down, you really don't want to get >> lucky, hit that one-in-fifty successful upload, and have it upload work >> that isn't due for weeks, when you've got something that will expire in >> a few hours. >> >> I do see your point. This would work: > > Thanks. > > >> Keep a count of failed uploads. A successful retry sets the count to >> zero, a failure increments. >> >> If the counter reaches two or three, re-sort the work units by due-date, >> and reset the timers so that they run out in deadline-order. > > Overly complicated. Just simply always assume the worst. You can't get > any better than the best ordering. Choose ordered if last _transfer_ was > a success, round-robin if last _transfer_ failed. > > >> I agree that a saturated link has a much lower data rate, but instead of >> slowing down after you get a connection, how about slowing down the >> connections so the link isn't saturated to start with? > > That doesn't work in the same way, and that is the problem. > > A single connection request can get through on an uncongested link, and > then be the cause of another +10Mbit/s uplink of data that then > saturates the link. Note that if I unrestrict the Boinc uplink limit, I > can overload (DOS) the s...@h link with just one machine! > > The s...@h servers can add traffic management easily for themselves for > their /outgoing/ data to avoid saturating their link (but does Boinc or > the upload/download servers do that at the moment?). However, they > cannot directly traffic manage what gets delayed or dropped for the > incoming traffic (uploads from clients) to avoid the logjam on the ISP > side of the link. > > Hence, you need to rate limit the data rate for each individual client > rather than just hope that you never get more than five or so clients > connecting simultaneously. > > For example, if you have transfers in progress for 10 clients > simultaneously, you must restrict their data rates to one tenth of the > link speed. Otherwise, you suffer a 'disgraceful' degradation as data > packets get dropped/lost and resent, multiple times over... > > > Dropping data packets is always a bad 'last ditch' bodge mess. It's just > that a TCP connection hides all the mess apart from the resultant data > rate slowdown. > > Regards, > Martin > > >> Martin wrote: >>> David Anderson wrote: >>>> If you have a build environment, check out the trunk and build. >>>> Otherwise we'll have to wait for Rom to either backport this to 6.6 >>>> or 6.8. >>>> >>>> -- David >>>> >>>> Lynn W. Taylor wrote: >>> [---] >>>>>>> So, when any retry timer runs out, instead of retrying that WU, >>>>>>> retry the one with the earliest deadline -- the one at the highest >>>>>>> risk. >>> Beware the case where an upload fails because of some db or storage >>> problem for a singular WU... One problem WU upload shouldn't cause all >>> others to fail. >>> >>> I suggest use round-robin ordering when any upload fails (after all, >>> no WUs are getting through in any case) and use earliest deadline >>> first order only when any upload has succeeded. >>> >>> >>> A second idea is to have the Boinc *client* dynamically reduce it's >>> upload data rate in real time if it detects any lost data packets >>> (detects TCP resends). The upload is abandoned for a backoff time if >>> the upload rate reduces to too low a rate. This is a more sensitive >>> attempt to avoid a DDOS on the project servers than just using an >>> arbitrary backoff mechanism. >>> >>> Note that a saturated link has a much LOWER effective bandwidth than a >>> maximally utilised link. >>> >>> Regards, >>> Martin > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
