In summary: Always use upload ordering, EDF or RR dependant on whether the last _transfer_ was successful;
The present Boinc exponential backoff *maintains* the *overload* problem and creates a DDOS; Possible improvements to the exisiting backoff scheme are suggested, BUT... ; Must /also/ add dynamic data rate limiting client-side to avoid saturating a project link bandwidth; Overload wastes available bandwidth and worsens/lengthens the "DDOS", the overload causes a 'disgraceful' degradation of service; More detailed comments inlined below: Regards, Martin Lynn W. Taylor wrote: > If uploads are working, any order will do. No reason to sort. Indeed so assuming no prior interruption and infinitely fast uploads. However, note that for large uploads over a slow uplink, the upload can take many minutes. I have various CPDN tasks with the Boinc upload deliberately restricted so as to not interfere with home internet usage. A CPDN upload can take well over an hour. If you have programmed in upload order sorting, you may aswel have it always enabled. No need then for additional checks to decide whether or not to use a sorted upload order. Just simply always do an ordered upload. > The only time you need to sort is if you're going to slow down the > retries (so you don't keep hammering the servers to death). Avoid the test for whether to sort or not. Simply always assume the worst and so always do the best possible upload order. > Stop hammering servers, reduce the load, and things will go much more > smoothly. Recovery after an outage will be faster. That is what the exponential backoff *does not* attempt to do. There is no feedback to dynamically avoid a DDOS. Also, the exponential backoff only takes effect long long after there are problems and then it *maintains* a level of *overload* ! Note that the present backoff mechanism will immediately reinstate a DDOS as soon as any overload condition is seen to ease and thereby perpetually re-establishes the overload. I guess this is because the intention of the exponential backoff was to allow for complete loss of service when a project goes completely offline. The present scheme is not actually designed for network load management or service load management. (I guess Matt at s...@h has been chasing ghosts on the server settings when in fact the real problem and control is in the routers at either end of their 100Mb/s bottleneck. Loose the ACKs/ICMP on a link and you see all manner of weird 'strangeness' that by it's very nature you have no control over. It all goes 'random'...) Rather than an exponential backoff that gets reset to zero immediately upon a successful connection, instead a bodge-fix could be to use the present exponential backoff but when a connection succeeds, then start a slow-start exponential decay from that backoff time. (Hence holding the backoff and low connection attempts rate for some time to let the overload subside.) Also upload/download only one WU at a time until no further connection failures are seen, again to avoid recreating an overload. (And yes, singular uploads are more expensive than multiple uploads, but just one mega-cruncher would swamp and DOS the project otherwise.) Then eventually return to the normal scheme of upload as much as possible in "one big bang" one the backoff has successfully decayed back to zero. The idea there is that the backoff time is maintained during, and for some time afterwards, so that a project server overload can clear without creating a perpetual re-DDOSing. Really, the Boinc server should dynamically retune the client backoff parameters for each project to match the expected network load vs the network bandwidth available. > But, if you do slow the client way down, you really don't want to get > lucky, hit that one-in-fifty successful upload, and have it upload work > that isn't due for weeks, when you've got something that will expire in > a few hours. > > I do see your point. This would work: Thanks. > Keep a count of failed uploads. A successful retry sets the count to > zero, a failure increments. > > If the counter reaches two or three, re-sort the work units by due-date, > and reset the timers so that they run out in deadline-order. Overly complicated. Just simply always assume the worst. You can't get any better than the best ordering. Choose ordered if last _transfer_ was a success, round-robin if last _transfer_ failed. > I agree that a saturated link has a much lower data rate, but instead of > slowing down after you get a connection, how about slowing down the > connections so the link isn't saturated to start with? That doesn't work in the same way, and that is the problem. A single connection request can get through on an uncongested link, and then be the cause of another +10Mbit/s uplink of data that then saturates the link. Note that if I unrestrict the Boinc uplink limit, I can overload (DOS) the s...@h link with just one machine! The s...@h servers can add traffic management easily for themselves for their /outgoing/ data to avoid saturating their link (but does Boinc or the upload/download servers do that at the moment?). However, they cannot directly traffic manage what gets delayed or dropped for the incoming traffic (uploads from clients) to avoid the logjam on the ISP side of the link. Hence, you need to rate limit the data rate for each individual client rather than just hope that you never get more than five or so clients connecting simultaneously. For example, if you have transfers in progress for 10 clients simultaneously, you must restrict their data rates to one tenth of the link speed. Otherwise, you suffer a 'disgraceful' degradation as data packets get dropped/lost and resent, multiple times over... Dropping data packets is always a bad 'last ditch' bodge mess. It's just that a TCP connection hides all the mess apart from the resultant data rate slowdown. Regards, Martin > Martin wrote: >> David Anderson wrote: >>> If you have a build environment, check out the trunk and build. >>> Otherwise we'll have to wait for Rom to either backport this to 6.6 >>> or 6.8. >>> >>> -- David >>> >>> Lynn W. Taylor wrote: >> [---] >>>>>> So, when any retry timer runs out, instead of retrying that WU, >>>>>> retry the one with the earliest deadline -- the one at the highest >>>>>> risk. >> >> Beware the case where an upload fails because of some db or storage >> problem for a singular WU... One problem WU upload shouldn't cause all >> others to fail. >> >> I suggest use round-robin ordering when any upload fails (after all, >> no WUs are getting through in any case) and use earliest deadline >> first order only when any upload has succeeded. >> >> >> A second idea is to have the Boinc *client* dynamically reduce it's >> upload data rate in real time if it detects any lost data packets >> (detects TCP resends). The upload is abandoned for a backoff time if >> the upload rate reduces to too low a rate. This is a more sensitive >> attempt to avoid a DDOS on the project servers than just using an >> arbitrary backoff mechanism. >> >> Note that a saturated link has a much LOWER effective bandwidth than a >> maximally utilised link. >> >> Regards, >> Martin -- -------------------- Martin Lomas m_boincdev ml1 co uk.ddSPAM.dd -------------------- _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
