In summary:

Always use upload ordering, EDF or RR dependant on whether the last 
_transfer_ was successful;

The present Boinc exponential backoff *maintains* the *overload* problem 
and creates a DDOS;

Possible improvements to the exisiting backoff scheme are suggested, 
BUT... ;

Must /also/ add dynamic data rate limiting client-side to avoid 
saturating a project link bandwidth;

Overload wastes available bandwidth and worsens/lengthens the "DDOS", 
the overload causes a 'disgraceful' degradation of service;


More detailed comments inlined below:

Regards,
Martin



Lynn W. Taylor wrote:
> If uploads are working, any order will do.  No reason to sort.

Indeed so assuming no prior interruption and infinitely fast uploads.

However, note that for large uploads over a slow uplink, the upload can 
take many minutes. I have various CPDN tasks with the Boinc upload 
deliberately restricted so as to not interfere with home internet usage. 
A CPDN upload can take well over an hour.

If you have programmed in upload order sorting, you may aswel have it 
always enabled. No need then for additional checks to decide whether or 
not to use a sorted upload order. Just simply always do an ordered upload.


> The only time you need to sort is if you're going to slow down the 
> retries (so you don't keep hammering the servers to death).

Avoid the test for whether to sort or not. Simply always assume the 
worst and so always do the best possible upload order.


> Stop hammering servers, reduce the load, and things will go much more 
> smoothly.  Recovery after an outage will be faster.

That is what the exponential backoff *does not* attempt to do.

There is no feedback to dynamically avoid a DDOS. Also, the exponential 
backoff only takes effect long long after there are problems and then it 
*maintains* a level of *overload* ! Note that the present backoff 
mechanism will immediately reinstate a DDOS as soon as any overload 
condition is seen to ease and thereby perpetually re-establishes the 
overload.

I guess this is because the intention of the exponential backoff was to 
allow for complete loss of service when a project goes completely 
offline. The present scheme is not actually designed for network load 
management or service load management.

(I guess Matt at s...@h has been chasing ghosts on the server settings when 
in fact the real problem and control is in the routers at either end of 
their 100Mb/s bottleneck. Loose the ACKs/ICMP on a link and you see all 
manner of weird 'strangeness' that by it's very nature you have no 
control over. It all goes 'random'...)


Rather than an exponential backoff that gets reset to zero immediately 
upon a successful connection, instead a bodge-fix could be to use the 
present exponential backoff but when a connection succeeds, then start a 
slow-start exponential decay from that backoff time. (Hence holding the 
backoff and low connection attempts rate for some time to let the 
overload subside.) Also upload/download only one WU at a time until no 
further connection failures are seen, again to avoid recreating an 
overload.

(And yes, singular uploads are more expensive than multiple uploads, but 
just one mega-cruncher would swamp and DOS the project otherwise.)

Then eventually return to the normal scheme of upload as much as 
possible in "one big bang" one the backoff has successfully decayed back 
to zero.

The idea there is that the backoff time is maintained during, and for 
some time afterwards, so that a project server overload can clear 
without creating a perpetual re-DDOSing.


Really, the Boinc server should dynamically retune the client backoff 
parameters for each project to match the expected network load vs the 
network bandwidth available.


> But, if you do slow the client way down, you really don't want to get 
> lucky, hit that one-in-fifty successful upload, and have it upload work 
> that isn't due for weeks, when you've got something that will expire in 
> a few hours.
> 
> I do see your point.  This would work:

Thanks.


> Keep a count of failed uploads.  A successful retry sets the count to 
> zero, a failure increments.
> 
> If the counter reaches two or three, re-sort the work units by due-date, 
> and reset the timers so that they run out in deadline-order.

Overly complicated. Just simply always assume the worst. You can't get 
any better than the best ordering. Choose ordered if last _transfer_ was 
a success, round-robin if last _transfer_ failed.


> I agree that a saturated link has a much lower data rate, but instead of 
> slowing down after you get a connection, how about slowing down the 
> connections so the link isn't saturated to start with?

That doesn't work in the same way, and that is the problem.

A single connection request can get through on an uncongested link, and 
then be the cause of another +10Mbit/s uplink of data that then 
saturates the link. Note that if I unrestrict the Boinc uplink limit, I 
can overload (DOS) the s...@h link with just one machine!

The s...@h servers can add traffic management easily for themselves for 
their /outgoing/ data to avoid saturating their link (but does Boinc or 
the upload/download servers do that at the moment?). However, they 
cannot directly traffic manage what gets delayed or dropped for the 
incoming traffic (uploads from clients) to avoid the logjam on the ISP 
side of the link.

Hence, you need to rate limit the data rate for each individual client 
rather than just hope that you never get more than five or so clients 
connecting simultaneously.

For example, if you have transfers in progress for 10 clients 
simultaneously, you must restrict their data rates to one tenth of the 
link speed. Otherwise, you suffer a 'disgraceful' degradation as data 
packets get dropped/lost and resent, multiple times over...


Dropping data packets is always a bad 'last ditch' bodge mess. It's just 
that a TCP connection hides all the mess apart from the resultant data 
rate slowdown.

Regards,
Martin


> Martin wrote:
>> David Anderson wrote:
>>> If you have a build environment, check out the trunk and build.
>>> Otherwise we'll have to wait for Rom to either backport this to 6.6 
>>> or 6.8.
>>>
>>> -- David
>>>
>>> Lynn W. Taylor wrote:
>> [---]
>>>>>> So, when any retry timer runs out, instead of retrying that WU, 
>>>>>> retry the one with the earliest deadline -- the one at the highest 
>>>>>> risk.
>>
>> Beware the case where an upload fails because of some db or storage 
>> problem for a singular WU... One problem WU upload shouldn't cause all 
>> others to fail.
>>
>> I suggest use round-robin ordering when any upload fails (after all, 
>> no WUs are getting through in any case) and use earliest deadline 
>> first order only when any upload has succeeded.
>>
>>
>> A second idea is to have the Boinc *client* dynamically reduce it's 
>> upload data rate in real time if it detects any lost data packets 
>> (detects TCP resends). The upload is abandoned for a backoff time if 
>> the upload rate reduces to too low a rate. This is a more sensitive 
>> attempt to avoid a DDOS on the project servers than just using an 
>> arbitrary backoff mechanism.
>>
>> Note that a saturated link has a much LOWER effective bandwidth than a 
>> maximally utilised link.
>>
>> Regards,
>> Martin


-- 
--------------------
Martin Lomas
m_boincdev ml1 co uk.ddSPAM.dd
--------------------
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to