[freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Arne Babenhauserheide
Am Mittwoch, 31. August 2011, 13:25:35 schrieb Matthew Toseland:
> On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
> > what this means: if a SSK has a mean success rate of 0.16, then using
> > 0.25 as value makes sure that 95% of the possible cases don?t exhaust
> > the bandwidth. We then use only 64% of the bandwidth on average,
> > though. With 0.2, we?d get 68% of the possible distributions safe and
> > use 80% of bandwidth on average.
> Only if there are no clusters - a single requestor fetches a bunch of stuff
> that is all already there, rather than polling keys that usually aren't.
> IMHO there will be. For instance, when a new chat client is started up, a
> lot of what it does will be fetching existing messages rather than polling
> for new ones.

But these are in cache, so the routes will be very short. And SSKs with very 
high HTL (18, 17, 16) have good success rates, so this code won?t affect them 
much less than low-HTL SSKs (the unsuccessful ones). After HTL 15 or so, their 
success rate drops massively - which is quite easy to explain: The content 
mostly isn?t there. 

Simply take the success-rate per HTL - as in the statistics window. 

Best wishes, 
Arne
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 316 bytes
Desc: This is a digitally signed message part.
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Matthew Toseland
On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
> Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
> > 5) solution: count each SSK as only
> > average_SSK_success_rate * data_to_transfer_on_success.
> 
> Some more data: 
> 
> chances of having at least this many successful transfers for 40 SSKs with a 
> mean success rate of 16%: 
> 
> for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done
> 
> 0 1.0
> 1 0.999064224991
> 2 0.99193451064
> 3 0.965452714478
> 4 0.901560126912
> 5 0.788987472629
> 6 0.634602118184
> 7 0.463062835467
> 8 0.304359825607
> 9 0.179664603573
> 10 0.0952149293922
> 11 0.0453494074947
> 12 0.0194452402752
> 13 0.00752109980912
> 14 0.0026291447461
> 15 0.000832100029072
> 16 0.00023879002726
> 
> what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as 
> value makes sure that 95% of the possible cases don?t exhaust the bandwidth. 
> We then use only 64% of the bandwidth on average, though. With 0.2, we?d get 
> 68% of the possible distributions safe and use 80% of bandwidth on average.

Only if there are no clusters - a single requestor fetches a bunch of stuff 
that is all already there, rather than polling keys that usually aren't. IMHO 
there will be. For instance, when a new chat client is started up, a lot of 
what it does will be fetching existing messages rather than polling for new 
ones.
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: 



Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Matthew Toseland
On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
 Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
  5) solution: count each SSK as only
  average_SSK_success_rate * data_to_transfer_on_success.
 
 Some more data: 
 
 chances of having at least this many successful transfers for 40 SSKs with a 
 mean success rate of 16%: 
 
 for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done
 
 0 1.0
 1 0.999064224991
 2 0.99193451064
 3 0.965452714478
 4 0.901560126912
 5 0.788987472629
 6 0.634602118184
 7 0.463062835467
 8 0.304359825607
 9 0.179664603573
 10 0.0952149293922
 11 0.0453494074947
 12 0.0194452402752
 13 0.00752109980912
 14 0.0026291447461
 15 0.000832100029072
 16 0.00023879002726
 
 what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as 
 value makes sure that 95% of the possible cases don’t exhaust the bandwidth. 
 We then use only 64% of the bandwidth on average, though. With 0.2, we’d get 
 68% of the possible distributions safe and use 80% of bandwidth on average.

Only if there are no clusters - a single requestor fetches a bunch of stuff 
that is all already there, rather than polling keys that usually aren't. IMHO 
there will be. For instance, when a new chat client is started up, a lot of 
what it does will be fetching existing messages rather than polling for new 
ones.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Matthew Toseland
On Monday 29 Aug 2011 20:34:01 Ian Clarke wrote:
 On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland t...@amphibian.dyndns.org
  wrote:
 
- What is the average load reported by responses this node
  forwarded, per
   remote node
  
   Ahhh, this one could be interesting - you could use it to penalise nodes
  which spam excessively.
 
  Actually, thinking about it, I don't think this will work. Requests are
  usually distributed evenly across the keyspace, and downstream nodes only
  know they are from this node - not which previous node they are from. So the
  pain will be shared across all the peers sending us requests, including our
  own local requests, and won't be identifiable as due to any single node. You
  have to do it by volume, not by reported load.
 
 But if a node is being abusive won't the nodes its requesting from tend to
 have a higher than average load?  

If its requests differ from the specialisation of the node in general, so are 
routed to a particular subset, then they might be different. But this is not 
necessarily true if it is a deliberate attack.

 Perhaps you are right though, but this is
 the kind of thought-process we need.

To prevent a node from using an excessive proportion of our capacity, we need 
to *measure the proportion of our capacity that the node uses*. This much is 
straightforward IMHO. And that's the basis of fair sharing between peers.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Arne Babenhauserheide
Am Mittwoch, 31. August 2011, 13:25:35 schrieb Matthew Toseland:
 On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
  what this means: if a SSK has a mean success rate of 0.16, then using
  0.25 as value makes sure that 95% of the possible cases don’t exhaust
  the bandwidth. We then use only 64% of the bandwidth on average,
  though. With 0.2, we’d get 68% of the possible distributions safe and
  use 80% of bandwidth on average.
 Only if there are no clusters - a single requestor fetches a bunch of stuff
 that is all already there, rather than polling keys that usually aren't.
 IMHO there will be. For instance, when a new chat client is started up, a
 lot of what it does will be fetching existing messages rather than polling
 for new ones.

But these are in cache, so the routes will be very short. And SSKs with very
high HTL (18, 17, 16) have good success rates, so this code won’t affect them
much less than low-HTL SSKs (the unsuccessful ones). After HTL 15 or so, their
success rate drops massively - which is quite easy to explain: The content
mostly isn’t there.

Simply take the success-rate per HTL - as in the statistics window.

Best wishes,
Arne


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Arne Babenhauserheide
Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
> 5) solution: count each SSK as only
>   average_SSK_success_rate * data_to_transfer_on_success.

Some more data: 

chances of having at least this many successful transfers for 40 SSKs with a 
mean success rate of 16%: 

for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done

0 1.0
1 0.999064224991
2 0.99193451064
3 0.965452714478
4 0.901560126912
5 0.788987472629
6 0.634602118184
7 0.463062835467
8 0.304359825607
9 0.179664603573
10 0.0952149293922
11 0.0453494074947
12 0.0194452402752
13 0.00752109980912
14 0.0026291447461
15 0.000832100029072
16 0.00023879002726

what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as 
value makes sure that 95% of the possible cases don?t exhaust the bandwidth. 
We then use only 64% of the bandwidth on average, though. With 0.2, we?d get 
68% of the possible distributions safe and use 80% of bandwidth on average.

Note: this is just a binomial spread: 

from math import factorial
fac = factorial
def n?k(n, k): 
   if k > n: return 0
   return fac(n) / (fac(k)*fac(n-k))

def binom(p, n, k): 
   return n?k(n, k) * p** k * (1-p)**(n-k)

def spielf?hig(p, n, min_spieler): 
   return sum([binom(p, n, k) for k in range(min_spieler, n+1)])


? USK at 6~ZDYdvAgMoUfG6M5Kwi7SQqyS-
gTcyFeaNN1Pf3FvY,OSOT4OEeg4xyYnwcGECZUX6~lnmYrZsz05Km7G7bvOQ,AQACAAE/bab/9/Content-
D426DC7.html


Best wishes, 
Arne
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 316 bytes
Desc: This is a digitally signed message part.
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Arne Bab.
Von: "Matthew Toseland" ???
>But the other question is, can queueing ever be helpful? It can if it allows 
>us to route more accurately (which NLM clearly does), and/or to run enough 
>requests in parallel that the longer time taken for the request to reach its 
>destination is offset. Is this condition met?

Experience with the deployed NLM showed that even in the fully congested case 
it had success rates of 60% for HTL 18,17 and 16, compared to less than 40% for 
OLM. This means that the requests are sent over fewer hops on average, because 
find the content fewer hops away from the requester.

A download of 1MiB which is sent over 2 hops needs 2 MiB in total network 
bandwidth.
If it is sent over only 1.5 hops on average, then it needs only 1.5 MiB total 
network bandwidth.

So essentially NLM can distribute 30% more content with the same network 
resources?. And these numbers are actual observations. The only reason why this 
did not result in increased performance is that the nodes used less than 50% of 
their allocated bandwidth? - which is a problem with the bandwidth scheduler 
and not with queueing.

Best wishes,
Arne

?: The relevant network resource is upload bandwidth.
?: Source: observations from me and two other freenet users.

PS: How exactly the bandwidth limiter is fixed is an implementation detail. I 
think you are actually the only person who can judge how to do this most 
efficiently.

___
Schon geh?rt? WEB.DE hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://produkte.web.de/go/toolbar
-- next part --
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1743 bytes
Desc: S/MIME Cryptographic Signature
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Arne Babenhauserheide
After discussing and going deeper into this, it became apparent that the 
problem is not overload of queues. I?ll repeat the result I came to here (and 
not in chatlog form :) ): 

The problem is in the load limiter: 


1) we need to use all bandwidth, because latency depends on the number of  
finishing transfers.

2) SSKs fail 85% of the time.

3) the load limiter assumes that all SSKs succeed and reserves bandwidth for 
them.

4) this now bites us, because NLM has shorter transfer times and longer wait 
times. So the over-reservation of bandwidth for SSKs lasts longer. 

5) solution: count each SSK as only 
average_SSK_success_rate * data_to_transfer_on_success.


To (1): When we have less bandwidth, than less CHKs succeed per second, so 
less slots get freed and the queueing time it bigger.

Best wishes, 
Arne



Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> Okay, I don't understand most of that, I might be able to check the math if
> it was written properly, but it looks difficult. However, as far as I can
> see: - The most obvious way to increase bandwidth usage would be to
> increase the timeout time for output bandwidth liability (and at the same
> time increase the relevant block transfer timeouts). - This would increase
> the number of slots but it would also increase the number of requests
> seeking them; I don't see why it would help matters. - Running an excessive
> number of requests without adjusting the block transfer timeouts would
> result in some of the transfers timing out. - It would also, as you
> mention, make SSK requests (especially failed SSK requests) even slower. -
> I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
> As far as I can see, all the benefits of NLM re attackers are achieved by
> fair sharing. - A system which is more resistant to attacks but slower
> probably isn't all that interesting if the attacks in question are
> relatively expensive anyway. - Smaller block sizes would have a significant
> efficiency cost, and would probably make load management more difficult.
> 
> I apologise if you see this as rather a broadside after I encouraged you to
> analyse the problem, but it is not yet a convincing demonstration that
> queueing is a viable strategy and not the spawn of satan! :)
> On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > > Yes, small tweaks have worked so well for us for the last decade,
> > > leaving us pretty-much where we were in 2003.  No, we don't
> > > understand how the current system works, there is no point in
> > > trying to fix something when we don't even know what is broken.
> > 
> > I?d like to present a clue what is broken in NLM. Before I kill you with
> > the> 
> > log, here?s the result:
> > With NLM the latency of a request is a function of the raw bandwidth
> > (not so with OLM), and NLM used only half my bandwidth after it had
> > been deployed for 2 days (at the start much more).
> > 
> > ? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ; with ?: transfer time, q: queue
> > time (time to find the node), nlm: new load management, olm: old load
> > management.
> > 
> > So first step: make sure all bandwidth gets used - maybe by allocating
> > more slots till we use all allowed bandwidth. Better having to throttle
> > a transfer than not using bandwidth.
> > 
> > *NLM should with the current network be slower than OLM by 23%. But in
> > 18
> > months it should actually be faster by ~8% ? given Moores Law holds for
> > upload bandwidth ? because the routes are shorter.*
> > 
> > The main advantage of NLM is, that it should be much more resilient
> > against attackers (DoS).
> > 
> > Now to the log - it?s math and not cleaned up; you have been warned :)
> > 
> >  SSK-time: ?, CHK-time: ?, success: Xs, fail: Xf.
> >  queue-time: q, transfer-time: ?, hops remaining: h, total
> > hops: h?, w: success rate
> >  ?s = ?(h) + q(h)
> >  ?f = q(h)
> >  ? ~ w???s + (1-w?)??f
> >  ?s = ?(h) + q(h)
> >  ?f = q(h)
> >  ? ~ w???s + (1-w?)??f; w? ~ 15%
> >  num(?) / num(?) ~ 1
> >  ? time ~ ? + ?
> >  q(h) depends on timeouts, as do w? and w?
> >  time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
> >  = w? ? (?(h) + q(h)) + (1-w?)?q(h) + w? ? (?(h) + q(h)) + (1-
> > w?)?q(h)
> >  = t(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
> >  = ?(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
> >  in the congestion case q(h) ~ timeout
> >  timeout = o
> >  timeout: o
> >  w depends on the timeout *somehow*, but inversely
> >  o=0 ? w=0
> >  assumption: o = ? ? w? ~ 20%, w? ~ 100%
> >  assumption: o = ? ? w? ~ 0.2, w? ~ 1
> >  correction: in the congestion case: q(h) ~ min(timeout, ?(h))
> >  timeout matters for q(h) only when timeout < ?(h)
> >  I try to: I still need a dependency of w on timeout
> >  ? lets call it t(w)
> >  better: w(o) :)
> >  well, if there is a timeout, we have a fixed time, but we reduce
> > the hops ...
> >  i 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:18:35 Matthew Toseland wrote:
> On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
> > After discussing and going deeper into this, it became apparent that the 
> > problem is not overload of queues. I?ll repeat the result I came to here 
> > (and 
> > not in chatlog form :) ): 
> > 
> > The problem is in the load limiter: 
> > 
> > 
> > 1) we need to use all bandwidth, because latency depends on the number of  
> > finishing transfers.
> > 
> > 2) SSKs fail 85% of the time.
> > 
> > 3) the load limiter assumes that all SSKs succeed and reserves bandwidth 
> > for 
> > them.
> > 
> > 4) this now bites us, because NLM has shorter transfer times and longer 
> > wait 
> > times. So the over-reservation of bandwidth for SSKs lasts longer. 
> > 
> > 5) solution: count each SSK as only 
> > average_SSK_success_rate * data_to_transfer_on_success.
> > 
> > 
> > To (1): When we have less bandwidth, than less CHKs succeed per second, so 
> > less slots get freed and the queueing time it bigger.
> 
> Another way to look at this:
> 
> Both me and Ian have been assuming that the resource is some number 
> representing a property of the node, for example, how many requests are 
> running. But the real resource is the external bandwidth usage, which is 
> rather different.
> 
> An alternative solution - if we are worried about bursts of requests 
> succeeding and therefore timing out - is to allocate a fixed chunk of 
> bandwidth for SSKs, and not let them interfere with CHKs. Then CHKs can have 
> long transfer times, and therefore long queueing times when they fail (but 
> this is only ~ 40% of the time with NLM's accurate routing), and SSKs can 
> have short transfer times, and therefore short queueing times. This may be 
> less disruptive and should also be relatively easy to implement.
> 
> But the other question is, can queueing ever be helpful? It can if it allows 
> us to route more accurately (which NLM clearly does), and/or to run enough 
> requests in parallel that the longer time taken for the request to reach its 
> destination is offset. Is this condition met?

Yes it is!

If a request takes P seconds without queueing, and P+Q seconds with queueing, 
then if we have the same number of requests running at once, we have a slowdown 
of (P+Q)/P. However, if we start (P+Q)/P times more requests, and the queueing 
time remains constant, we can compensate for this. 

Does it? 

There are two main components:
1) The queueing time. This is roughly proportional to (number of requests 
running) / (number of slots available). If we increase the capacity of the 
node, we multiply both the top and bottom by the same factor - so the queueing 
time is unchanged, unless the data transfer time is changed.
2) The data transfer time. In the above case with a slowdown, the data transfer 
time is the same, or a little shorter, because most of our bandwidth isn't 
being used. If we increase it by running more requests at once, it should reach 
where it was, and if we increase it further, transfers will start to take 
longer again.

ArneBab has analysed this in more detail and shown that the largest component 
of the queueing time is proportional to the data transfer time, hence, 
separating SSKs (which have a very short transfer time) from CHKs (which have a 
much longer transfer time) would help significantly.

The key advantage of NLM, of course, is that it routes accurately, resulting in 
10%-30% improvements in success rates, and transferring data over significantly 
fewer hops (= more effective bandwidth).
> > 
> > Best wishes, 
> > Arne
> > 
> > 
> > 
> > Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> > > Okay, I don't understand most of that, I might be able to check the math 
> > > if
> > > it was written properly, but it looks difficult. However, as far as I can
> > > see: - The most obvious way to increase bandwidth usage would be to
> > > increase the timeout time for output bandwidth liability (and at the same
> > > time increase the relevant block transfer timeouts). - This would increase
> > > the number of slots but it would also increase the number of requests
> > > seeking them; I don't see why it would help matters. - Running an 
> > > excessive
> > > number of requests without adjusting the block transfer timeouts would
> > > result in some of the transfers timing out. - It would also, as you
> > > mention, make SSK requests (especially failed SSK requests) even slower. -
> > > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. 
> > > -
> > > As far as I can see, all the benefits of NLM re attackers are achieved by
> > > fair sharing. - A system which is more resistant to attacks but slower
> > > probably isn't all that interesting if the attacks in question are
> > > relatively expensive anyway. - Smaller block sizes would have a 
> > > significant
> > > efficiency cost, and would probably make load management more 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
> After discussing and going deeper into this, it became apparent that the 
> problem is not overload of queues. I?ll repeat the result I came to here (and 
> not in chatlog form :) ): 
> 
> The problem is in the load limiter: 
> 
> 
> 1) we need to use all bandwidth, because latency depends on the number of  
>   finishing transfers.
> 
> 2) SSKs fail 85% of the time.
> 
> 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for 
>   them.
> 
> 4) this now bites us, because NLM has shorter transfer times and longer wait 
>   times. So the over-reservation of bandwidth for SSKs lasts longer. 
> 
> 5) solution: count each SSK as only 
>   average_SSK_success_rate * data_to_transfer_on_success.
> 
> 
> To (1): When we have less bandwidth, than less CHKs succeed per second, so 
> less slots get freed and the queueing time it bigger.

Another way to look at this:

Both me and Ian have been assuming that the resource is some number 
representing a property of the node, for example, how many requests are 
running. But the real resource is the external bandwidth usage, which is rather 
different.

An alternative solution - if we are worried about bursts of requests succeeding 
and therefore timing out - is to allocate a fixed chunk of bandwidth for SSKs, 
and not let them interfere with CHKs. Then CHKs can have long transfer times, 
and therefore long queueing times when they fail (but this is only ~ 40% of the 
time with NLM's accurate routing), and SSKs can have short transfer times, and 
therefore short queueing times. This may be less disruptive and should also be 
relatively easy to implement.

But the other question is, can queueing ever be helpful? It can if it allows us 
to route more accurately (which NLM clearly does), and/or to run enough 
requests in parallel that the longer time taken for the request to reach its 
destination is offset. Is this condition met?
> 
> Best wishes, 
> Arne
> 
> 
> 
> Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> > Okay, I don't understand most of that, I might be able to check the math if
> > it was written properly, but it looks difficult. However, as far as I can
> > see: - The most obvious way to increase bandwidth usage would be to
> > increase the timeout time for output bandwidth liability (and at the same
> > time increase the relevant block transfer timeouts). - This would increase
> > the number of slots but it would also increase the number of requests
> > seeking them; I don't see why it would help matters. - Running an excessive
> > number of requests without adjusting the block transfer timeouts would
> > result in some of the transfers timing out. - It would also, as you
> > mention, make SSK requests (especially failed SSK requests) even slower. -
> > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
> > As far as I can see, all the benefits of NLM re attackers are achieved by
> > fair sharing. - A system which is more resistant to attacks but slower
> > probably isn't all that interesting if the attacks in question are
> > relatively expensive anyway. - Smaller block sizes would have a significant
> > efficiency cost, and would probably make load management more difficult.
> > 
> > I apologise if you see this as rather a broadside after I encouraged you to
> > analyse the problem, but it is not yet a convincing demonstration that
> > queueing is a viable strategy and not the spawn of satan! :)
> > On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> > > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > > > Yes, small tweaks have worked so well for us for the last decade,
> > > > leaving us pretty-much where we were in 2003.  No, we don't
> > > > understand how the current system works, there is no point in
> > > > trying to fix something when we don't even know what is broken.
> > > 
> > > I?d like to present a clue what is broken in NLM. Before I kill you with
> > > the> 
> > > log, here?s the result:
> > >   With NLM the latency of a request is a function of the raw bandwidth
> > >   (not so with OLM), and NLM used only half my bandwidth after it had
> > >   been deployed for 2 days (at the start much more).
> > > 
> > > ? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ; with ?: transfer time, q: queue
> > > time (time to find the node), nlm: new load management, olm: old load
> > > management.
> > > 
> > > So first step: make sure all bandwidth gets used - maybe by allocating
> > > more slots till we use all allowed bandwidth. Better having to throttle
> > > a transfer than not using bandwidth.
> > > 
> > > *NLM should with the current network be slower than OLM by 23%. But in
> > > 18
> > > months it should actually be faster by ~8% ? given Moores Law holds for
> > > upload bandwidth ? because the routes are shorter.*
> > > 
> > > The main advantage of NLM is, that it should be much more resilient
> > 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Arne Babenhauserheide
Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
 5) solution: count each SSK as only
   average_SSK_success_rate * data_to_transfer_on_success.

Some more data:

chances of having at least this many successful transfers for 40 SSKs with a
mean success rate of 16%:

for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done

0 1.0
1 0.999064224991
2 0.99193451064
3 0.965452714478
4 0.901560126912
5 0.788987472629
6 0.634602118184
7 0.463062835467
8 0.304359825607
9 0.179664603573
10 0.0952149293922
11 0.0453494074947
12 0.0194452402752
13 0.00752109980912
14 0.0026291447461
15 0.000832100029072
16 0.00023879002726

what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as
value makes sure that 95% of the possible cases don’t exhaust the bandwidth.
We then use only 64% of the bandwidth on average, though. With 0.2, we’d get
68% of the possible distributions safe and use 80% of bandwidth on average.

Note: this is just a binomial spread:

from math import factorial
fac = factorial
def nük(n, k):
   if k  n: return 0
   return fac(n) / (fac(k)*fac(n-k))

def binom(p, n, k):
   return nük(n, k) * p** k * (1-p)**(n-k)

def spielfähig(p, n, min_spieler):
   return sum([binom(p, n, k) for k in range(min_spieler, n+1)])


→ USK@6~ZDYdvAgMoUfG6M5Kwi7SQqyS-
gTcyFeaNN1Pf3FvY,OSOT4OEeg4xyYnwcGECZUX6~lnmYrZsz05Km7G7bvOQ,AQACAAE/bab/9/Content-
D426DC7.html


Best wishes,
Arne

signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:32:13 Ian Clarke wrote:
> On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland  amphibian.dyndns.org
> > wrote:
> 
> > On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > > proposal.  I'm assuming that we get rid of NLM, fair sharing, and
> > anything
> > > else intended to control load, and replace it with this.  We will
> > absolutely
> > > need to simulate this before we write a single line of code to deploy it.
> >
> > I have justified fair sharing in my other reply. However alternate
> > mechanisms might replace it.
> >
> > I agree we should simulate, although this might be significant work.
> >
> 
> I doubt it would be more work than trying it without simulation and failing
> for a decade, which was our previous approach.
> 
> 
> > > The core idea is that a node will include a floating point number in
> > > response to any kind of request showing how close that node is to being
> > > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > > completely saturated and must reject requests.  Clearly the goal is to
> > avoid
> > > getting anywhere near to 1.0.
> > >
> > > A node tracks several things:
> >
> > All of these apply to both locally originated requests and remotely
> > originated requests? And are they for only our direct peers?
> >
> 
> Every peer along the reply path will take note of the load value, so it will
> be a mixture of the loads of close-by peers, and more distant peers.
>  Obviously, since we are more likely to receive messages from peers that are
> "closer", our average will automatically be biased towards closer peers.

So it is from the data source? Or every node along the way?
> 
> >- What is the average load reported by responses this node forwarded,
> > per
> > >remote node
> >
> > Ahhh, this one could be interesting - you could use it to penalise nodes
> > which spam excessively.
> >
> Exactly.
> 
> > > I think, given these metrics, we should be able to do the following:
> > >
> > >- Limit our overall rate of initiating local requests based on the
> > global
> > >average reported load
> >
> > We can already do this on the basis of AIMD's, based purely on behaviour of
> > local requests. However, taking more data into account might result in more
> > accurate results - although it might also dilute the feedback.
> 
> Right, AIMD tends to control the rate at which a specific event occurs, like
> a dropped IP packet with TCP, or a dropped request as we currently use it.
>  This is a bit like signaling that you have a headache by banging your head
> against the wall.  It is also a rather coarse way to deliver feedback.

Fair point.
> 
> A floating point load number conveys more information in a non-destructive
> way.
> 
> In any case, this part is "safe": it's not going to cause feedback problems,
> > on its own. Conversely, it doesn't deal with problems such as network
> > bottlenecks.
> 
> Well, it depends.  If we are tracking load on a per-outbound-connection
> basis then we can potentially take action to manage load per-connection in
> addition to overall.

Not sure I follow here. Routing by load is bad, unless it is severely limited.
> 
> > However, while AIMD's are based on the *global* average load (or rather the
> > load of all the nodes along the chain), it sounds like you are proposing to
> > only report the *local* load of our direct peers? How are you going to
> > compute the global average load?
> 
> No, its not the local load of our direct peers.  Its the load reported in
> all reply messages that we forward, some of those will be from our direct
> peers, but most will be from remote peers.
> 
> > >- Limit our rate of local requests based on the average load of the
> > >connection to the peer it would need to be forwarded to
> >
> > Not sure I follow here. Are you proposing that we initiate more requests in
> > keyspace areas where there is lower load?
> 
> Well, kinda, except the opposite.  We might reduce requests in keyspaces
> where there is higher load.
> 
> 
> > >- Detect when remote peers are abusing the system by disregarding load
> > -
> > >as evidenced by a significantly higher average load in replies
> > forwarded to
> > >them
> >
> > Okay, that is useful - although I doubt it would be an unambiguous
> > detection, and what would we do about it? We would have to reject requests,
> > and if we're not sure, or if it could be a downstream troublemaker, couldn't
> > that result in misrouting and make things worse? On the other hand, any
> > malicious node would be most easily detected by the nodes directly connected
> > to it, so this could work much the same way as fair sharing - provided that
> > it can reject or terminate the requests.
> 
> No, we probably wouldn't start rejecting requests - more likely we would
> disconnect from that node.  Obviously we'd need to be fairly sure of abusive
> 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
Okay, I don't understand most of that, I might be able to check the math if it 
was written properly, but it looks difficult. However, as far as I can see:
- The most obvious way to increase bandwidth usage would be to increase the 
timeout time for output bandwidth liability (and at the same time increase the 
relevant block transfer timeouts).
- This would increase the number of slots but it would also increase the number 
of requests seeking them; I don't see why it would help matters.
- Running an excessive number of requests without adjusting the block transfer 
timeouts would result in some of the transfers timing out.
- It would also, as you mention, make SSK requests (especially failed SSK 
requests) even slower.
- I am quite confident that Moore's Law DOES NOT hold for upload bandwidth.
- As far as I can see, all the benefits of NLM re attackers are achieved by 
fair sharing.
- A system which is more resistant to attacks but slower probably isn't all 
that interesting if the attacks in question are relatively expensive anyway.
- Smaller block sizes would have a significant efficiency cost, and would 
probably make load management more difficult.

I apologise if you see this as rather a broadside after I encouraged you to 
analyse the problem, but it is not yet a convincing demonstration that queueing 
is a viable strategy and not the spawn of satan! :)

On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > Yes, small tweaks have worked so well for us for the last decade, leaving us
> > pretty-much where we were in 2003.  No, we don't understand how the current
> > system works, there is no point in trying to fix something when we don't
> > even know what is broken.
> 
> I?d like to present a clue what is broken in NLM. Before I kill you with the 
> log, here?s the result: 
> 
>   With NLM the latency of a request is a function of the raw bandwidth 
>   (not so with OLM), and NLM used only half my bandwidth after it had 
>   been deployed for 2 days (at the start much more).
> 
> ? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ; with ?: transfer time, q: queue time 
> (time to find the node), nlm: new load management, olm: old load management.
> 
> So first step: make sure all bandwidth gets used - maybe by allocating more 
> slots till we use all allowed bandwidth. Better having to throttle a transfer 
> than not using bandwidth.
> 
> *NLM should with the current network be slower than OLM by 23%. But in 18 
> months it should actually be faster by ~8% ? given Moores Law holds for 
> upload 
> bandwidth ? because the routes are shorter.*
> 
> The main advantage of NLM is, that it should be much more resilient against 
> attackers (DoS).
> 
> Now to the log - it?s math and not cleaned up; you have been warned :)
> 
>  SSK-time: ?, CHK-time: ?, success: Xs, fail: Xf. 
>  queue-time: q, transfer-time: ?, hops remaining: h, total hops: h?, 
> w: success rate
>  ?s = ?(h) + q(h)
>  ?f = q(h)
>  ? ~ w???s + (1-w?)??f
>  ?s = ?(h) + q(h)
>  ?f = q(h)
>  ? ~ w???s + (1-w?)??f; w? ~ 15%
>  num(?) / num(?) ~ 1
>  ? time ~ ? + ?
>  q(h) depends on timeouts, as do w? and w?
>  time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
>  = w? ? (?(h) + q(h)) + (1-w?)?q(h) + w? ? (?(h) + q(h)) + (1-
> w?)?q(h)
>  = t(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
>  = ?(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
>  in the congestion case q(h) ~ timeout
>  timeout = o
>  timeout: o
>  w depends on the timeout *somehow*, but inversely
>  o=0 ? w=0
>  assumption: o = ? ? w? ~ 20%, w? ~ 100%
>  assumption: o = ? ? w? ~ 0.2, w? ~ 1
>  correction: in the congestion case: q(h) ~ min(timeout, ?(h))
>  timeout matters for q(h) only when timeout < ?(h)
>  I try to: I still need a dependency of w on timeout
>  ? lets call it t(w)
>  better: w(o) :)
>  well, if there is a timeout, we have a fixed time, but we reduce the 
> hops ...
>  i thought w was success rate
>  ah! 
>  and the success rates where in the NLM stats
>  going mostly smoothly from 60% to 0%
>  for the HTL
>  right, success rate peaks at 18 or sometimes 16
>  what are w1 vs w2?
>  chk vs ssk i guess
>  yes
> -*- toad_ thinks considering both is probably overambitious at this stage?
>  should not be too bad: SSKs drop much more rapidly at decreasing 
> hops
>  hops?HTL
>  ?s is time for a successful chk; ?f is time for a failed chk ... in 
> which case h in the first instance is low, and in the second instance is h0
>  yes
>  okay, i don't follow this line: time =  w???s + (1-w?)??f +  w???s + 
> (1-w?)??f
>  i thought w2 related to SSKs?
>  uh, yes?
>  time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
>  you have to appreciate i'm only just getting back into maths and 
> physics after 12 years ...
>  (retaking a-levels to get a degree)
>  no probs, I?m also no expert in this. I try to get a relation 
> between the time and the timeout, so we can try to find a minimum
>  in any case, there are two different h's for the 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Babenhauserheide
Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> Yes, small tweaks have worked so well for us for the last decade, leaving us
> pretty-much where we were in 2003.  No, we don't understand how the current
> system works, there is no point in trying to fix something when we don't
> even know what is broken.

I?d like to present a clue what is broken in NLM. Before I kill you with the 
log, here?s the result: 


With NLM the latency of a request is a function of the raw bandwidth 
(not so with OLM), and NLM used only half my bandwidth after it had 
been deployed for 2 days (at the start much more).


? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ; with ?: transfer time, q: queue time 
(time to find the node), nlm: new load management, olm: old load management.



So first step: make sure all bandwidth gets used - maybe by allocating more 
slots till we use all allowed bandwidth. Better having to throttle a transfer 
than not using bandwidth.


*NLM should with the current network be slower than OLM by 23%. But in 18 
months it should actually be faster by ~8% ? given Moores Law holds for upload 
bandwidth ? because the routes are shorter.*


The main advantage of NLM is, that it should be much more resilient against 
attackers (DoS).


Now to the log - it?s math and not cleaned up; you have been warned :)


 SSK-time: ?, CHK-time: ?, success: Xs, fail: Xf. 
 queue-time: q, transfer-time: ?, hops remaining: h, total hops: h?, 
w: success rate
 ?s = ?(h) + q(h)
 ?f = q(h)
 ? ~ w???s + (1-w?)??f
 ?s = ?(h) + q(h)
 ?f = q(h)
 ? ~ w???s + (1-w?)??f; w? ~ 15%
 num(?) / num(?) ~ 1
 ? time ~ ? + ?
 q(h) depends on timeouts, as do w? and w?
 time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
 = w? ? (?(h) + q(h)) + (1-w?)?q(h) + w? ? (?(h) + q(h)) + (1-
w?)?q(h)
 = t(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
 = ?(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
 in the congestion case q(h) ~ timeout
 timeout = o
 timeout: o
 w depends on the timeout *somehow*, but inversely
 o=0 ? w=0
 assumption: o = ? ? w? ~ 20%, w? ~ 100%
 assumption: o = ? ? w? ~ 0.2, w? ~ 1
 correction: in the congestion case: q(h) ~ min(timeout, ?(h))
 timeout matters for q(h) only when timeout < ?(h)
 I try to: I still need a dependency of w on timeout
 ? lets call it t(w)
 better: w(o) :)
 well, if there is a timeout, we have a fixed time, but we reduce the 
hops ...
 i thought w was success rate
 ah! 
 and the success rates where in the NLM stats
 going mostly smoothly from 60% to 0%
 for the HTL
 right, success rate peaks at 18 or sometimes 16
 what are w1 vs w2?
 chk vs ssk i guess
 yes
-*- toad_ thinks considering both is probably overambitious at this stage?
 should not be too bad: SSKs drop much more rapidly at decreasing 
hops
 hops?HTL
 ?s is time for a successful chk; ?f is time for a failed chk ... in 
which case h in the first instance is low, and in the second instance is h0
 yes
 okay, i don't follow this line: time =  w???s + (1-w?)??f +  w???s + 
(1-w?)??f
 i thought w2 related to SSKs?
 uh, yes?
 time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
 you have to appreciate i'm only just getting back into maths and 
physics after 12 years ...
 (retaking a-levels to get a degree)
 no probs, I?m also no expert in this. I try to get a relation 
between the time and the timeout, so we can try to find a minimum
 in any case, there are two different h's for the two uses of q(h) - h0 
and h_avg
 h_avg for success and h0 for failure
 hm, yes
 which makes this harder? 
 it?s wrong anyway? the q(h_avg) was missing
 h_avg is somewhere between 5 and 10 imho
 at least it is if everything is working well and the input load isn't 
all really popular stuff (in which case it's answered quickly and can be 
ignored)
 = ?(h) ? (w?+w?) + q(h) ? (2-w?-w?) + q(h_avg) ? (w?+w?)
 would have been too easy :)
 okay so here q(h) means q(h0) i.e. h = h0, max hops?
 jepp, and max hops sinks with falling timeout
 hmm?
 the max actual hops
 on the upside, q() is linear
<-- Torgal (~Torgal at 78.251.49.8) hat das Netzwerk verlassen (Ping timeout: 
276 
seconds)
 hopefully
 yes: q(h) = h?o
 (in the congestion case)
 the problem i have is it looks like q(1) ~= time [ for a full request 
], unless load is very low
 so ? ? q?
 ? much smaller than q?
 of course it's bounded by timeouts, but i'd expect a runaway feedback 
loop until it reaches heavy timeouts and effectively cuts the htl
 well, with OLM, success time for a CHK is 1m25s, unsuccessful is 
19sec, so transfer time is at least 1 minute
 and less than 1m25; but with NLM, unsuccessful is 3 min+
 well, for SSKs in OLM the first 6 hops are the most successful, later 
ones only contribute 1% success, which piles up to ~ 12%
 okay...
 (only possible because 85% are unsuccessful)
 (otherwise this would be wrong: the contribution of later ones would 
be smaller)
 from the numbers q(h?) ~ ?(h_avg)
 well, queueing time on any hop is the time it takes to get a slot to 
route to, which is roughly equal to the time it takes 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:09:58 Matthew Toseland wrote:
> On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
> > else intended to control load, and replace it with this.  We will absolutely
> > need to simulate this before we write a single line of code to deploy it.
> 
> I have justified fair sharing in my other reply. However alternate mechanisms 
> might replace it.
> 
> I agree we should simulate, although this might be significant work.
> > 
> > The core idea is that a node will include a floating point number in
> > response to any kind of request showing how close that node is to being
> > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > completely saturated and must reject requests.  Clearly the goal is to avoid
> > getting anywhere near to 1.0.
> > 
> > A node tracks several things:
> 
> All of these apply to both locally originated requests and remotely 
> originated requests? And are they for only our direct peers?
> > 
> >- What is the overall average load reported by responses this node has
> >received
> >- What is the average load reported by responses this node has received,
> >per remote node
> >- What is the average load reported by responses this node forwarded, per
> >remote node
> 
> Ahhh, this one could be interesting - you could use it to penalise nodes 
> which spam excessively.

Actually, thinking about it, I don't think this will work. Requests are usually 
distributed evenly across the keyspace, and downstream nodes only know they are 
from this node - not which previous node they are from. So the pain will be 
shared across all the peers sending us requests, including our own local 
requests, and won't be identifiable as due to any single node. You have to do 
it by volume, not by reported load.
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
> else intended to control load, and replace it with this.  We will absolutely
> need to simulate this before we write a single line of code to deploy it.

I have justified fair sharing in my other reply. However alternate mechanisms 
might replace it.

I agree we should simulate, although this might be significant work.
> 
> The core idea is that a node will include a floating point number in
> response to any kind of request showing how close that node is to being
> overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> completely saturated and must reject requests.  Clearly the goal is to avoid
> getting anywhere near to 1.0.
> 
> A node tracks several things:

All of these apply to both locally originated requests and remotely originated 
requests? And are they for only our direct peers?
> 
>- What is the overall average load reported by responses this node has
>received
>- What is the average load reported by responses this node has received,
>per remote node
>- What is the average load reported by responses this node forwarded, per
>remote node

Ahhh, this one could be interesting - you could use it to penalise nodes which 
spam excessively.
> 
> I think, given these metrics, we should be able to do the following:
> 
>- Limit our overall rate of initiating local requests based on the global
>average reported load

We can already do this on the basis of AIMD's, based purely on behaviour of 
local requests. However, taking more data into account might result in more 
accurate results - although it might also dilute the feedback.

In any case, this part is "safe": it's not going to cause feedback problems, on 
its own. Conversely, it doesn't deal with problems such as network bottlenecks.

However, while AIMD's are based on the *global* average load (or rather the 
load of all the nodes along the chain), it sounds like you are proposing to 
only report the *local* load of our direct peers? How are you going to compute 
the global average load?

>- Limit our rate of local requests based on the average load of the
>connection to the peer it would need to be forwarded to

Not sure I follow here. Are you proposing that we initiate more requests in 
keyspace areas where there is lower load?

>- Detect when remote peers are abusing the system by disregarding load -
>as evidenced by a significantly higher average load in replies forwarded to
>them

Okay, that is useful - although I doubt it would be an unambiguous detection, 
and what would we do about it? We would have to reject requests, and if we're 
not sure, or if it could be a downstream troublemaker, couldn't that result in 
misrouting and make things worse? On the other hand, any malicious node would 
be most easily detected by the nodes directly connected to it, so this could 
work much the same way as fair sharing - provided that it can reject or 
terminate the requests.

Arguably fair sharing achieves the same objective. It could be extended in 
various ways e.g. if we are relaying a lot of slow-down notices to a peer we 
could start to think about squeezing the peer's allocation.
> 
> Of course, lots of questions:
> 
>- How do we compute the averages?  A decaying running average of some
>kind?  What time period?

Yay more tunables. :(

>- How do we translate load into a desired rate of requests?

And even more tunables. :(

>- What criteria indicate that a peer is abusing the system?  What is the
>remedy?

In the current system, don't accept their requests.

In ANY system, there will always be a borderline below which they can get away 
with it. We can either try to make this as small as possible or try not to hit 
legitimate traffic too hard.
> 
> This is basically control theory stuff, and we'll need robust answers to
> these questions before we proceed (ideally with a theoretical rather than
> experimental foundation).

IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much the 
same without having to wait years for our theoreticians to wake up.
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland  wrote:

>  > >- What is the average load reported by responses this node
> forwarded, per
> > >remote node
> >
> > Ahhh, this one could be interesting - you could use it to penalise nodes
> which spam excessively.
>
> Actually, thinking about it, I don't think this will work. Requests are
> usually distributed evenly across the keyspace, and downstream nodes only
> know they are from this node - not which previous node they are from. So the
> pain will be shared across all the peers sending us requests, including our
> own local requests, and won't be identifiable as due to any single node. You
> have to do it by volume, not by reported load.
>

But if a node is being abusive won't the nodes its requesting from tend to
have a higher than average load?  Perhaps you are right though, but this is
the kind of thought-process we need.

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: ian at freenetproject.org
-- next part --
An HTML attachment was scrubbed...
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland  wrote:

> On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > proposal.  I'm assuming that we get rid of NLM, fair sharing, and
> anything
> > else intended to control load, and replace it with this.  We will
> absolutely
> > need to simulate this before we write a single line of code to deploy it.
>
> I have justified fair sharing in my other reply. However alternate
> mechanisms might replace it.
>
> I agree we should simulate, although this might be significant work.
>

I doubt it would be more work than trying it without simulation and failing
for a decade, which was our previous approach.


> > The core idea is that a node will include a floating point number in
> > response to any kind of request showing how close that node is to being
> > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > completely saturated and must reject requests.  Clearly the goal is to
> avoid
> > getting anywhere near to 1.0.
> >
> > A node tracks several things:
>
> All of these apply to both locally originated requests and remotely
> originated requests? And are they for only our direct peers?
>

Every peer along the reply path will take note of the load value, so it will
be a mixture of the loads of close-by peers, and more distant peers.
 Obviously, since we are more likely to receive messages from peers that are
"closer", our average will automatically be biased towards closer peers.

>- What is the average load reported by responses this node forwarded,
> per
> >remote node
>
> Ahhh, this one could be interesting - you could use it to penalise nodes
> which spam excessively.
>

Exactly.


> > I think, given these metrics, we should be able to do the following:
> >
> >- Limit our overall rate of initiating local requests based on the
> global
> >average reported load
>
> We can already do this on the basis of AIMD's, based purely on behaviour of
> local requests. However, taking more data into account might result in more
> accurate results - although it might also dilute the feedback.
>

Right, AIMD tends to control the rate at which a specific event occurs, like
a dropped IP packet with TCP, or a dropped request as we currently use it.
 This is a bit like signaling that you have a headache by banging your head
against the wall.  It is also a rather coarse way to deliver feedback.

A floating point load number conveys more information in a non-destructive
way.

In any case, this part is "safe": it's not going to cause feedback problems,
> on its own. Conversely, it doesn't deal with problems such as network
> bottlenecks.
>

Well, it depends.  If we are tracking load on a per-outbound-connection
basis then we can potentially take action to manage load per-connection in
addition to overall.


> However, while AIMD's are based on the *global* average load (or rather the
> load of all the nodes along the chain), it sounds like you are proposing to
> only report the *local* load of our direct peers? How are you going to
> compute the global average load?
>

No, its not the local load of our direct peers.  Its the load reported in
all reply messages that we forward, some of those will be from our direct
peers, but most will be from remote peers.


> >- Limit our rate of local requests based on the average load of the
> >connection to the peer it would need to be forwarded to
>
> Not sure I follow here. Are you proposing that we initiate more requests in
> keyspace areas where there is lower load?
>

Well, kinda, except the opposite.  We might reduce requests in keyspaces
where there is higher load.


> >- Detect when remote peers are abusing the system by disregarding load
> -
> >as evidenced by a significantly higher average load in replies
> forwarded to
> >them
>
> Okay, that is useful - although I doubt it would be an unambiguous
> detection, and what would we do about it? We would have to reject requests,
> and if we're not sure, or if it could be a downstream troublemaker, couldn't
> that result in misrouting and make things worse? On the other hand, any
> malicious node would be most easily detected by the nodes directly connected
> to it, so this could work much the same way as fair sharing - provided that
> it can reject or terminate the requests.
>

No, we probably wouldn't start rejecting requests - more likely we would
disconnect from that node.  Obviously we'd need to be fairly sure of abusive
behavior before doing that - but abuse should be rare.


> Arguably fair sharing achieves the same objective. It could be extended in
> various ways e.g. if we are relaying a lot of slow-down notices to a peer we
> could start to think about squeezing the peer's allocation.
>

Yes, except it doesn't work, and its too complicated to reason about.


> > Of course, lots of questions:
> >
> >- How do we compute the averages?  A 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
Ok, thinking about it, here is a proposal, or  rather, the beginning of a
proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
else intended to control load, and replace it with this.  We will absolutely
need to simulate this before we write a single line of code to deploy it.

The core idea is that a node will include a floating point number in
response to any kind of request showing how close that node is to being
overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
completely saturated and must reject requests.  Clearly the goal is to avoid
getting anywhere near to 1.0.

A node tracks several things:

   - What is the overall average load reported by responses this node has
   received
   - What is the average load reported by responses this node has received,
   per remote node
   - What is the average load reported by responses this node forwarded, per
   remote node

I think, given these metrics, we should be able to do the following:

   - Limit our overall rate of initiating local requests based on the global
   average reported load
   - Limit our rate of local requests based on the average load of the
   connection to the peer it would need to be forwarded to
   - Detect when remote peers are abusing the system by disregarding load -
   as evidenced by a significantly higher average load in replies forwarded to
   them

Of course, lots of questions:

   - How do we compute the averages?  A decaying running average of some
   kind?  What time period?
   - How do we translate load into a desired rate of requests?
   - What criteria indicate that a peer is abusing the system?  What is the
   remedy?

This is basically control theory stuff, and we'll need robust answers to
these questions before we proceed (ideally with a theoretical rather than
experimental foundation).

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: ian at freenetproject.org
-- next part --
An HTML attachment was scrubbed...
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
Ok, thinking about it, here is a proposal, or  rather, the beginning of a
proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
else intended to control load, and replace it with this.  We will absolutely
need to simulate this before we write a single line of code to deploy it.

The core idea is that a node will include a floating point number in
response to any kind of request showing how close that node is to being
overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
completely saturated and must reject requests.  Clearly the goal is to avoid
getting anywhere near to 1.0.

A node tracks several things:

   - What is the overall average load reported by responses this node has
   received
   - What is the average load reported by responses this node has received,
   per remote node
   - What is the average load reported by responses this node forwarded, per
   remote node

I think, given these metrics, we should be able to do the following:

   - Limit our overall rate of initiating local requests based on the global
   average reported load
   - Limit our rate of local requests based on the average load of the
   connection to the peer it would need to be forwarded to
   - Detect when remote peers are abusing the system by disregarding load -
   as evidenced by a significantly higher average load in replies forwarded to
   them

Of course, lots of questions:

   - How do we compute the averages?  A decaying running average of some
   kind?  What time period?
   - How do we translate load into a desired rate of requests?
   - What criteria indicate that a peer is abusing the system?  What is the
   remedy?

This is basically control theory stuff, and we'll need robust answers to
these questions before we proceed (ideally with a theoretical rather than
experimental foundation).

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: i...@freenetproject.org
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
 Ok, thinking about it, here is a proposal, or  rather, the beginning of a
 proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
 else intended to control load, and replace it with this.  We will absolutely
 need to simulate this before we write a single line of code to deploy it.

I have justified fair sharing in my other reply. However alternate mechanisms 
might replace it.

I agree we should simulate, although this might be significant work.
 
 The core idea is that a node will include a floating point number in
 response to any kind of request showing how close that node is to being
 overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
 completely saturated and must reject requests.  Clearly the goal is to avoid
 getting anywhere near to 1.0.
 
 A node tracks several things:

All of these apply to both locally originated requests and remotely originated 
requests? And are they for only our direct peers?
 
- What is the overall average load reported by responses this node has
received
- What is the average load reported by responses this node has received,
per remote node
- What is the average load reported by responses this node forwarded, per
remote node

Ahhh, this one could be interesting - you could use it to penalise nodes which 
spam excessively.
 
 I think, given these metrics, we should be able to do the following:
 
- Limit our overall rate of initiating local requests based on the global
average reported load

We can already do this on the basis of AIMD's, based purely on behaviour of 
local requests. However, taking more data into account might result in more 
accurate results - although it might also dilute the feedback.

In any case, this part is safe: it's not going to cause feedback problems, on 
its own. Conversely, it doesn't deal with problems such as network bottlenecks.

However, while AIMD's are based on the *global* average load (or rather the 
load of all the nodes along the chain), it sounds like you are proposing to 
only report the *local* load of our direct peers? How are you going to compute 
the global average load?

- Limit our rate of local requests based on the average load of the
connection to the peer it would need to be forwarded to

Not sure I follow here. Are you proposing that we initiate more requests in 
keyspace areas where there is lower load?

- Detect when remote peers are abusing the system by disregarding load -
as evidenced by a significantly higher average load in replies forwarded to
them

Okay, that is useful - although I doubt it would be an unambiguous detection, 
and what would we do about it? We would have to reject requests, and if we're 
not sure, or if it could be a downstream troublemaker, couldn't that result in 
misrouting and make things worse? On the other hand, any malicious node would 
be most easily detected by the nodes directly connected to it, so this could 
work much the same way as fair sharing - provided that it can reject or 
terminate the requests.

Arguably fair sharing achieves the same objective. It could be extended in 
various ways e.g. if we are relaying a lot of slow-down notices to a peer we 
could start to think about squeezing the peer's allocation.
 
 Of course, lots of questions:
 
- How do we compute the averages?  A decaying running average of some
kind?  What time period?

Yay more tunables. :(

- How do we translate load into a desired rate of requests?

And even more tunables. :(

- What criteria indicate that a peer is abusing the system?  What is the
remedy?

In the current system, don't accept their requests.

In ANY system, there will always be a borderline below which they can get away 
with it. We can either try to make this as small as possible or try not to hit 
legitimate traffic too hard.
 
 This is basically control theory stuff, and we'll need robust answers to
 these questions before we proceed (ideally with a theoretical rather than
 experimental foundation).

IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much the 
same without having to wait years for our theoreticians to wake up.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:09:58 Matthew Toseland wrote:
 On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
  Ok, thinking about it, here is a proposal, or  rather, the beginning of a
  proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
  else intended to control load, and replace it with this.  We will absolutely
  need to simulate this before we write a single line of code to deploy it.
 
 I have justified fair sharing in my other reply. However alternate mechanisms 
 might replace it.
 
 I agree we should simulate, although this might be significant work.
  
  The core idea is that a node will include a floating point number in
  response to any kind of request showing how close that node is to being
  overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
  completely saturated and must reject requests.  Clearly the goal is to avoid
  getting anywhere near to 1.0.
  
  A node tracks several things:
 
 All of these apply to both locally originated requests and remotely 
 originated requests? And are they for only our direct peers?
  
 - What is the overall average load reported by responses this node has
 received
 - What is the average load reported by responses this node has received,
 per remote node
 - What is the average load reported by responses this node forwarded, per
 remote node
 
 Ahhh, this one could be interesting - you could use it to penalise nodes 
 which spam excessively.

Actually, thinking about it, I don't think this will work. Requests are usually 
distributed evenly across the keyspace, and downstream nodes only know they are 
from this node - not which previous node they are from. So the pain will be 
shared across all the peers sending us requests, including our own local 
requests, and won't be identifiable as due to any single node. You have to do 
it by volume, not by reported load.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland t...@amphibian.dyndns.org
 wrote:

 On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
  Ok, thinking about it, here is a proposal, or  rather, the beginning of a
  proposal.  I'm assuming that we get rid of NLM, fair sharing, and
 anything
  else intended to control load, and replace it with this.  We will
 absolutely
  need to simulate this before we write a single line of code to deploy it.

 I have justified fair sharing in my other reply. However alternate
 mechanisms might replace it.

 I agree we should simulate, although this might be significant work.


I doubt it would be more work than trying it without simulation and failing
for a decade, which was our previous approach.


  The core idea is that a node will include a floating point number in
  response to any kind of request showing how close that node is to being
  overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
  completely saturated and must reject requests.  Clearly the goal is to
 avoid
  getting anywhere near to 1.0.
 
  A node tracks several things:

 All of these apply to both locally originated requests and remotely
 originated requests? And are they for only our direct peers?


Every peer along the reply path will take note of the load value, so it will
be a mixture of the loads of close-by peers, and more distant peers.
 Obviously, since we are more likely to receive messages from peers that are
closer, our average will automatically be biased towards closer peers.

- What is the average load reported by responses this node forwarded,
 per
 remote node

 Ahhh, this one could be interesting - you could use it to penalise nodes
 which spam excessively.


Exactly.


  I think, given these metrics, we should be able to do the following:
 
 - Limit our overall rate of initiating local requests based on the
 global
 average reported load

 We can already do this on the basis of AIMD's, based purely on behaviour of
 local requests. However, taking more data into account might result in more
 accurate results - although it might also dilute the feedback.


Right, AIMD tends to control the rate at which a specific event occurs, like
a dropped IP packet with TCP, or a dropped request as we currently use it.
 This is a bit like signaling that you have a headache by banging your head
against the wall.  It is also a rather coarse way to deliver feedback.

A floating point load number conveys more information in a non-destructive
way.

In any case, this part is safe: it's not going to cause feedback problems,
 on its own. Conversely, it doesn't deal with problems such as network
 bottlenecks.


Well, it depends.  If we are tracking load on a per-outbound-connection
basis then we can potentially take action to manage load per-connection in
addition to overall.


 However, while AIMD's are based on the *global* average load (or rather the
 load of all the nodes along the chain), it sounds like you are proposing to
 only report the *local* load of our direct peers? How are you going to
 compute the global average load?


No, its not the local load of our direct peers.  Its the load reported in
all reply messages that we forward, some of those will be from our direct
peers, but most will be from remote peers.


 - Limit our rate of local requests based on the average load of the
 connection to the peer it would need to be forwarded to

 Not sure I follow here. Are you proposing that we initiate more requests in
 keyspace areas where there is lower load?


Well, kinda, except the opposite.  We might reduce requests in keyspaces
where there is higher load.


 - Detect when remote peers are abusing the system by disregarding load
 -
 as evidenced by a significantly higher average load in replies
 forwarded to
 them

 Okay, that is useful - although I doubt it would be an unambiguous
 detection, and what would we do about it? We would have to reject requests,
 and if we're not sure, or if it could be a downstream troublemaker, couldn't
 that result in misrouting and make things worse? On the other hand, any
 malicious node would be most easily detected by the nodes directly connected
 to it, so this could work much the same way as fair sharing - provided that
 it can reject or terminate the requests.


No, we probably wouldn't start rejecting requests - more likely we would
disconnect from that node.  Obviously we'd need to be fairly sure of abusive
behavior before doing that - but abuse should be rare.


 Arguably fair sharing achieves the same objective. It could be extended in
 various ways e.g. if we are relaying a lot of slow-down notices to a peer we
 could start to think about squeezing the peer's allocation.


Yes, except it doesn't work, and its too complicated to reason about.


  Of course, lots of questions:
 
 - How do we compute the averages?  A decaying running average of some
 kind?  What time period?

 Yay more tunables. 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland t...@amphibian.dyndns.org
 wrote:

   - What is the average load reported by responses this node
 forwarded, per
  remote node
 
  Ahhh, this one could be interesting - you could use it to penalise nodes
 which spam excessively.

 Actually, thinking about it, I don't think this will work. Requests are
 usually distributed evenly across the keyspace, and downstream nodes only
 know they are from this node - not which previous node they are from. So the
 pain will be shared across all the peers sending us requests, including our
 own local requests, and won't be identifiable as due to any single node. You
 have to do it by volume, not by reported load.


But if a node is being abusive won't the nodes its requesting from tend to
have a higher than average load?  Perhaps you are right though, but this is
the kind of thought-process we need.

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: i...@freenetproject.org
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Babenhauserheide
Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
 Yes, small tweaks have worked so well for us for the last decade, leaving us
 pretty-much where we were in 2003.  No, we don't understand how the current
 system works, there is no point in trying to fix something when we don't
 even know what is broken.

I’d like to present a clue what is broken in NLM. Before I kill you with the
log, here’s the result:


With NLM the latency of a request is a function of the raw bandwidth
(not so with OLM), and NLM used only half my bandwidth after it had
been deployed for 2 days (at the start much more).


τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time
(time to find the node), nlm: new load management, olm: old load management.



So first step: make sure all bandwidth gets used - maybe by allocating more
slots till we use all allowed bandwidth. Better having to throttle a transfer
than not using bandwidth.


*NLM should with the current network be slower than OLM by 23%. But in 18
months it should actually be faster by ~8% — given Moores Law holds for upload
bandwidth — because the routes are shorter.*


The main advantage of NLM is, that it should be much more resilient against
attackers (DoS).


Now to the log - it’s math and not cleaned up; you have been warned :)


ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf.
ArneBab queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀,
w: success rate
ArneBab ψs = τ(h) + q(h)
ArneBab ψf = q(h)
ArneBab ψ ~ w₁·ψs + (1-w₁)·ψf
ArneBab σs = τ(h) + q(h)
ArneBab σf = q(h)
ArneBab σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
ArneBab num(ψ) / num(σ) ~ 1
ArneBab → time ~ σ + ψ
ArneBab q(h) depends on timeouts, as do w₁ and w₂
ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
ArneBab = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
w₂)·q(h)
ArneBab = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
ArneBab = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
ArneBab in the congestion case q(h) ~ timeout
ArneBab timeout = o
ArneBab timeout: o
ArneBab w depends on the timeout *somehow*, but inversely
ArneBab o=0 → w=0
ArneBab assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
ArneBab assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
ArneBab correction: in the congestion case: q(h) ~ min(timeout, τ(h))
ArneBab timeout matters for q(h) only when timeout  τ(h)
ArneBab I try to: I still need a dependency of w on timeout
ArneBab … lets call it t(w)
ArneBab better: w(o) :)
toad_ well, if there is a timeout, we have a fixed time, but we reduce the
hops ...
toad_ i thought w was success rate
ArneBab ah!
ArneBab and the success rates where in the NLM stats
ArneBab going mostly smoothly from 60% to 0%
ArneBab for the HTL
toad_ right, success rate peaks at 18 or sometimes 16
toad_ what are w1 vs w2?
toad_ chk vs ssk i guess
ArneBab yes
-*- toad_ thinks considering both is probably overambitious at this stage?
ArneBab should not be too bad: SSKs drop much more rapidly at decreasing
hops
ArneBab hops→HTL
toad_ ψs is time for a successful chk; ψf is time for a failed chk ... in
which case h in the first instance is low, and in the second instance is h0
ArneBab yes
toad_ okay, i don't follow this line: time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs +
(1-w₂)·ψf
toad_ i thought w2 related to SSKs?
ArneBab uh, yes…
ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·σs + (1-w₂)·σf
toad_ you have to appreciate i'm only just getting back into maths and
physics after 12 years ...
toad_ (retaking a-levels to get a degree)
ArneBab no probs, I’m also no expert in this. I try to get a relation
between the time and the timeout, so we can try to find a minimum
toad_ in any case, there are two different h's for the two uses of q(h) - h0
and h_avg
toad_ h_avg for success and h0 for failure
ArneBab hm, yes
ArneBab which makes this harder…
ArneBab it’s wrong anyway… the q(h_avg) was missing
toad_ h_avg is somewhere between 5 and 10 imho
toad_ at least it is if everything is working well and the input load isn't
all really popular stuff (in which case it's answered quickly and can be
ignored)
ArneBab = τ(h) · (w₁+w₂) + q(h) · (2-w₁-w₂) + q(h_avg) · (w₁+w₂)
ArneBab would have been too easy :)
toad_ okay so here q(h) means q(h0) i.e. h = h0, max hops?
ArneBab jepp, and max hops sinks with falling timeout
toad_ hmm?
ArneBab the max actual hops
toad_ on the upside, q() is linear
-- Torgal (~Torgal@78.251.49.8) hat das Netzwerk verlassen (Ping timeout: 276
seconds)
toad_ hopefully
ArneBab yes: q(h) = h·o
ArneBab (in the congestion case)
toad_ the problem i have is it looks like q(1) ~= time [ for a full request
], unless load is very low
ArneBab so τ « q?
ArneBab τ much smaller than q?
toad_ of course it's bounded by timeouts, but i'd expect a runaway feedback
loop until it reaches heavy timeouts and effectively cuts the htl
toad_ well, with OLM, success time for a CHK is 1m25s, unsuccessful is
19sec, so transfer time is at least 1 minute
toad_ and less than 1m25; but with NLM, unsuccessful is 3 min+
ArneBab 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
Okay, I don't understand most of that, I might be able to check the math if it 
was written properly, but it looks difficult. However, as far as I can see:
- The most obvious way to increase bandwidth usage would be to increase the 
timeout time for output bandwidth liability (and at the same time increase the 
relevant block transfer timeouts).
- This would increase the number of slots but it would also increase the number 
of requests seeking them; I don't see why it would help matters.
- Running an excessive number of requests without adjusting the block transfer 
timeouts would result in some of the transfers timing out.
- It would also, as you mention, make SSK requests (especially failed SSK 
requests) even slower.
- I am quite confident that Moore's Law DOES NOT hold for upload bandwidth.
- As far as I can see, all the benefits of NLM re attackers are achieved by 
fair sharing.
- A system which is more resistant to attacks but slower probably isn't all 
that interesting if the attacks in question are relatively expensive anyway.
- Smaller block sizes would have a significant efficiency cost, and would 
probably make load management more difficult.

I apologise if you see this as rather a broadside after I encouraged you to 
analyse the problem, but it is not yet a convincing demonstration that queueing 
is a viable strategy and not the spawn of satan! :)

On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
 Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
  Yes, small tweaks have worked so well for us for the last decade, leaving us
  pretty-much where we were in 2003.  No, we don't understand how the current
  system works, there is no point in trying to fix something when we don't
  even know what is broken.
 
 I’d like to present a clue what is broken in NLM. Before I kill you with the 
 log, here’s the result: 
 
   With NLM the latency of a request is a function of the raw bandwidth 
   (not so with OLM), and NLM used only half my bandwidth after it had 
   been deployed for 2 days (at the start much more).
 
 τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time 
 (time to find the node), nlm: new load management, olm: old load management.
 
 So first step: make sure all bandwidth gets used - maybe by allocating more 
 slots till we use all allowed bandwidth. Better having to throttle a transfer 
 than not using bandwidth.
 
 *NLM should with the current network be slower than OLM by 23%. But in 18 
 months it should actually be faster by ~8% — given Moores Law holds for 
 upload 
 bandwidth — because the routes are shorter.*
 
 The main advantage of NLM is, that it should be much more resilient against 
 attackers (DoS).
 
 Now to the log - it’s math and not cleaned up; you have been warned :)
 
 ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. 
 ArneBab queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀, 
 w: success rate
 ArneBab ψs = τ(h) + q(h)
 ArneBab ψf = q(h)
 ArneBab ψ ~ w₁·ψs + (1-w₁)·ψf
 ArneBab σs = τ(h) + q(h)
 ArneBab σf = q(h)
 ArneBab σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
 ArneBab num(ψ) / num(σ) ~ 1
 ArneBab → time ~ σ + ψ
 ArneBab q(h) depends on timeouts, as do w₁ and w₂
 ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
 ArneBab = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
 w₂)·q(h)
 ArneBab = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
 ArneBab = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
 ArneBab in the congestion case q(h) ~ timeout
 ArneBab timeout = o
 ArneBab timeout: o
 ArneBab w depends on the timeout *somehow*, but inversely
 ArneBab o=0 → w=0
 ArneBab assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
 ArneBab assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
 ArneBab correction: in the congestion case: q(h) ~ min(timeout, τ(h))
 ArneBab timeout matters for q(h) only when timeout  τ(h)
 ArneBab I try to: I still need a dependency of w on timeout
 ArneBab … lets call it t(w)
 ArneBab better: w(o) :)
 toad_ well, if there is a timeout, we have a fixed time, but we reduce the 
 hops ...
 toad_ i thought w was success rate
 ArneBab ah! 
 ArneBab and the success rates where in the NLM stats
 ArneBab going mostly smoothly from 60% to 0%
 ArneBab for the HTL
 toad_ right, success rate peaks at 18 or sometimes 16
 toad_ what are w1 vs w2?
 toad_ chk vs ssk i guess
 ArneBab yes
 -*- toad_ thinks considering both is probably overambitious at this stage?
 ArneBab should not be too bad: SSKs drop much more rapidly at decreasing 
 hops
 ArneBab hops→HTL
 toad_ ψs is time for a successful chk; ψf is time for a failed chk ... in 
 which case h in the first instance is low, and in the second instance is h0
 ArneBab yes
 toad_ okay, i don't follow this line: time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + 
 (1-w₂)·ψf
 toad_ i thought w2 related to SSKs?
 ArneBab uh, yes…
 ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·σs + (1-w₂)·σf
 toad_ you have to appreciate i'm only just getting back into maths and 
 physics after 12 years ...
 toad_ 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:32:13 Ian Clarke wrote:
 On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland t...@amphibian.dyndns.org
  wrote:
 
  On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
   Ok, thinking about it, here is a proposal, or  rather, the beginning of a
   proposal.  I'm assuming that we get rid of NLM, fair sharing, and
  anything
   else intended to control load, and replace it with this.  We will
  absolutely
   need to simulate this before we write a single line of code to deploy it.
 
  I have justified fair sharing in my other reply. However alternate
  mechanisms might replace it.
 
  I agree we should simulate, although this might be significant work.
 
 
 I doubt it would be more work than trying it without simulation and failing
 for a decade, which was our previous approach.
 
 
   The core idea is that a node will include a floating point number in
   response to any kind of request showing how close that node is to being
   overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
   completely saturated and must reject requests.  Clearly the goal is to
  avoid
   getting anywhere near to 1.0.
  
   A node tracks several things:
 
  All of these apply to both locally originated requests and remotely
  originated requests? And are they for only our direct peers?
 
 
 Every peer along the reply path will take note of the load value, so it will
 be a mixture of the loads of close-by peers, and more distant peers.
  Obviously, since we are more likely to receive messages from peers that are
 closer, our average will automatically be biased towards closer peers.

So it is from the data source? Or every node along the way?
 
 - What is the average load reported by responses this node forwarded,
  per
  remote node
 
  Ahhh, this one could be interesting - you could use it to penalise nodes
  which spam excessively.
 
 Exactly.
 
   I think, given these metrics, we should be able to do the following:
  
  - Limit our overall rate of initiating local requests based on the
  global
  average reported load
 
  We can already do this on the basis of AIMD's, based purely on behaviour of
  local requests. However, taking more data into account might result in more
  accurate results - although it might also dilute the feedback.
 
 Right, AIMD tends to control the rate at which a specific event occurs, like
 a dropped IP packet with TCP, or a dropped request as we currently use it.
  This is a bit like signaling that you have a headache by banging your head
 against the wall.  It is also a rather coarse way to deliver feedback.

Fair point.
 
 A floating point load number conveys more information in a non-destructive
 way.
 
 In any case, this part is safe: it's not going to cause feedback problems,
  on its own. Conversely, it doesn't deal with problems such as network
  bottlenecks.
 
 Well, it depends.  If we are tracking load on a per-outbound-connection
 basis then we can potentially take action to manage load per-connection in
 addition to overall.

Not sure I follow here. Routing by load is bad, unless it is severely limited.
 
  However, while AIMD's are based on the *global* average load (or rather the
  load of all the nodes along the chain), it sounds like you are proposing to
  only report the *local* load of our direct peers? How are you going to
  compute the global average load?
 
 No, its not the local load of our direct peers.  Its the load reported in
 all reply messages that we forward, some of those will be from our direct
 peers, but most will be from remote peers.
 
  - Limit our rate of local requests based on the average load of the
  connection to the peer it would need to be forwarded to
 
  Not sure I follow here. Are you proposing that we initiate more requests in
  keyspace areas where there is lower load?
 
 Well, kinda, except the opposite.  We might reduce requests in keyspaces
 where there is higher load.
 
 
  - Detect when remote peers are abusing the system by disregarding load
  -
  as evidenced by a significantly higher average load in replies
  forwarded to
  them
 
  Okay, that is useful - although I doubt it would be an unambiguous
  detection, and what would we do about it? We would have to reject requests,
  and if we're not sure, or if it could be a downstream troublemaker, couldn't
  that result in misrouting and make things worse? On the other hand, any
  malicious node would be most easily detected by the nodes directly connected
  to it, so this could work much the same way as fair sharing - provided that
  it can reject or terminate the requests.
 
 No, we probably wouldn't start rejecting requests - more likely we would
 disconnect from that node.  Obviously we'd need to be fairly sure of abusive
 behavior before doing that - but abuse should be rare.

Well then we'd need to be very sure, and this means you could get away with a 
lot before triggering it. Wierd topologies for instance could cause 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Babenhauserheide
After discussing and going deeper into this, it became apparent that the
problem is not overload of queues. I’ll repeat the result I came to here (and
not in chatlog form :) ):

The problem is in the load limiter:


1) we need to use all bandwidth, because latency depends on the number of
finishing transfers.

2) SSKs fail 85% of the time.

3) the load limiter assumes that all SSKs succeed and reserves bandwidth for
them.

4) this now bites us, because NLM has shorter transfer times and longer wait
times. So the over-reservation of bandwidth for SSKs lasts longer.

5) solution: count each SSK as only
average_SSK_success_rate * data_to_transfer_on_success.


To (1): When we have less bandwidth, than less CHKs succeed per second, so
less slots get freed and the queueing time it bigger.

Best wishes,
Arne



Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
 Okay, I don't understand most of that, I might be able to check the math if
 it was written properly, but it looks difficult. However, as far as I can
 see: - The most obvious way to increase bandwidth usage would be to
 increase the timeout time for output bandwidth liability (and at the same
 time increase the relevant block transfer timeouts). - This would increase
 the number of slots but it would also increase the number of requests
 seeking them; I don't see why it would help matters. - Running an excessive
 number of requests without adjusting the block transfer timeouts would
 result in some of the transfers timing out. - It would also, as you
 mention, make SSK requests (especially failed SSK requests) even slower. -
 I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
 As far as I can see, all the benefits of NLM re attackers are achieved by
 fair sharing. - A system which is more resistant to attacks but slower
 probably isn't all that interesting if the attacks in question are
 relatively expensive anyway. - Smaller block sizes would have a significant
 efficiency cost, and would probably make load management more difficult.

 I apologise if you see this as rather a broadside after I encouraged you to
 analyse the problem, but it is not yet a convincing demonstration that
 queueing is a viable strategy and not the spawn of satan! :)
 On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
  Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
   Yes, small tweaks have worked so well for us for the last decade,
   leaving us pretty-much where we were in 2003.  No, we don't
   understand how the current system works, there is no point in
   trying to fix something when we don't even know what is broken.
 
  I’d like to present a clue what is broken in NLM. Before I kill you with
  the
  log, here’s the result:
  With NLM the latency of a request is a function of the raw bandwidth
  (not so with OLM), and NLM used only half my bandwidth after it had
  been deployed for 2 days (at the start much more).
 
  τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue
  time (time to find the node), nlm: new load management, olm: old load
  management.
 
  So first step: make sure all bandwidth gets used - maybe by allocating
  more slots till we use all allowed bandwidth. Better having to throttle
  a transfer than not using bandwidth.
 
  *NLM should with the current network be slower than OLM by 23%. But in
  18
  months it should actually be faster by ~8% — given Moores Law holds for
  upload bandwidth — because the routes are shorter.*
 
  The main advantage of NLM is, that it should be much more resilient
  against attackers (DoS).
 
  Now to the log - it’s math and not cleaned up; you have been warned :)
 
  ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf.
  ArneBab queue-time: q, transfer-time: τ, hops remaining: h, total
  hops: h₀, w: success rate
  ArneBab ψs = τ(h) + q(h)
  ArneBab ψf = q(h)
  ArneBab ψ ~ w₁·ψs + (1-w₁)·ψf
  ArneBab σs = τ(h) + q(h)
  ArneBab σf = q(h)
  ArneBab σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
  ArneBab num(ψ) / num(σ) ~ 1
  ArneBab → time ~ σ + ψ
  ArneBab q(h) depends on timeouts, as do w₁ and w₂
  ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
  ArneBab = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
  w₂)·q(h)
  ArneBab = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
  ArneBab = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
  ArneBab in the congestion case q(h) ~ timeout
  ArneBab timeout = o
  ArneBab timeout: o
  ArneBab w depends on the timeout *somehow*, but inversely
  ArneBab o=0 → w=0
  ArneBab assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
  ArneBab assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
  ArneBab correction: in the congestion case: q(h) ~ min(timeout, τ(h))
  ArneBab timeout matters for q(h) only when timeout  τ(h)
  ArneBab I try to: I still need a dependency of w on timeout
  ArneBab … lets call it t(w)
  ArneBab better: w(o) :)
  toad_ well, if there is a timeout, we have a fixed time, but we reduce
  

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
 After discussing and going deeper into this, it became apparent that the 
 problem is not overload of queues. I’ll repeat the result I came to here (and 
 not in chatlog form :) ): 
 
 The problem is in the load limiter: 
 
 
 1) we need to use all bandwidth, because latency depends on the number of  
   finishing transfers.
 
 2) SSKs fail 85% of the time.
 
 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for 
   them.
 
 4) this now bites us, because NLM has shorter transfer times and longer wait 
   times. So the over-reservation of bandwidth for SSKs lasts longer. 
 
 5) solution: count each SSK as only 
   average_SSK_success_rate * data_to_transfer_on_success.
 
 
 To (1): When we have less bandwidth, than less CHKs succeed per second, so 
 less slots get freed and the queueing time it bigger.

Another way to look at this:

Both me and Ian have been assuming that the resource is some number 
representing a property of the node, for example, how many requests are 
running. But the real resource is the external bandwidth usage, which is rather 
different.

An alternative solution - if we are worried about bursts of requests succeeding 
and therefore timing out - is to allocate a fixed chunk of bandwidth for SSKs, 
and not let them interfere with CHKs. Then CHKs can have long transfer times, 
and therefore long queueing times when they fail (but this is only ~ 40% of the 
time with NLM's accurate routing), and SSKs can have short transfer times, and 
therefore short queueing times. This may be less disruptive and should also be 
relatively easy to implement.

But the other question is, can queueing ever be helpful? It can if it allows us 
to route more accurately (which NLM clearly does), and/or to run enough 
requests in parallel that the longer time taken for the request to reach its 
destination is offset. Is this condition met?
 
 Best wishes, 
 Arne
 
 
 
 Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
  Okay, I don't understand most of that, I might be able to check the math if
  it was written properly, but it looks difficult. However, as far as I can
  see: - The most obvious way to increase bandwidth usage would be to
  increase the timeout time for output bandwidth liability (and at the same
  time increase the relevant block transfer timeouts). - This would increase
  the number of slots but it would also increase the number of requests
  seeking them; I don't see why it would help matters. - Running an excessive
  number of requests without adjusting the block transfer timeouts would
  result in some of the transfers timing out. - It would also, as you
  mention, make SSK requests (especially failed SSK requests) even slower. -
  I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
  As far as I can see, all the benefits of NLM re attackers are achieved by
  fair sharing. - A system which is more resistant to attacks but slower
  probably isn't all that interesting if the attacks in question are
  relatively expensive anyway. - Smaller block sizes would have a significant
  efficiency cost, and would probably make load management more difficult.
  
  I apologise if you see this as rather a broadside after I encouraged you to
  analyse the problem, but it is not yet a convincing demonstration that
  queueing is a viable strategy and not the spawn of satan! :)
  On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
   Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
Yes, small tweaks have worked so well for us for the last decade,
leaving us pretty-much where we were in 2003.  No, we don't
understand how the current system works, there is no point in
trying to fix something when we don't even know what is broken.
   
   I’d like to present a clue what is broken in NLM. Before I kill you with
   the 
   log, here’s the result:
 With NLM the latency of a request is a function of the raw bandwidth
 (not so with OLM), and NLM used only half my bandwidth after it had
 been deployed for 2 days (at the start much more).
   
   τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue
   time (time to find the node), nlm: new load management, olm: old load
   management.
   
   So first step: make sure all bandwidth gets used - maybe by allocating
   more slots till we use all allowed bandwidth. Better having to throttle
   a transfer than not using bandwidth.
   
   *NLM should with the current network be slower than OLM by 23%. But in
   18
   months it should actually be faster by ~8% — given Moores Law holds for
   upload bandwidth — because the routes are shorter.*
   
   The main advantage of NLM is, that it should be much more resilient
   against attackers (DoS).
   
   Now to the log - it’s math and not cleaned up; you have been warned :)
   
   ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:18:35 Matthew Toseland wrote:
 On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
  After discussing and going deeper into this, it became apparent that the 
  problem is not overload of queues. I’ll repeat the result I came to here 
  (and 
  not in chatlog form :) ): 
  
  The problem is in the load limiter: 
  
  
  1) we need to use all bandwidth, because latency depends on the number of  
  finishing transfers.
  
  2) SSKs fail 85% of the time.
  
  3) the load limiter assumes that all SSKs succeed and reserves bandwidth 
  for 
  them.
  
  4) this now bites us, because NLM has shorter transfer times and longer 
  wait 
  times. So the over-reservation of bandwidth for SSKs lasts longer. 
  
  5) solution: count each SSK as only 
  average_SSK_success_rate * data_to_transfer_on_success.
  
  
  To (1): When we have less bandwidth, than less CHKs succeed per second, so 
  less slots get freed and the queueing time it bigger.
 
 Another way to look at this:
 
 Both me and Ian have been assuming that the resource is some number 
 representing a property of the node, for example, how many requests are 
 running. But the real resource is the external bandwidth usage, which is 
 rather different.
 
 An alternative solution - if we are worried about bursts of requests 
 succeeding and therefore timing out - is to allocate a fixed chunk of 
 bandwidth for SSKs, and not let them interfere with CHKs. Then CHKs can have 
 long transfer times, and therefore long queueing times when they fail (but 
 this is only ~ 40% of the time with NLM's accurate routing), and SSKs can 
 have short transfer times, and therefore short queueing times. This may be 
 less disruptive and should also be relatively easy to implement.
 
 But the other question is, can queueing ever be helpful? It can if it allows 
 us to route more accurately (which NLM clearly does), and/or to run enough 
 requests in parallel that the longer time taken for the request to reach its 
 destination is offset. Is this condition met?

Yes it is!

If a request takes P seconds without queueing, and P+Q seconds with queueing, 
then if we have the same number of requests running at once, we have a slowdown 
of (P+Q)/P. However, if we start (P+Q)/P times more requests, and the queueing 
time remains constant, we can compensate for this. 

Does it? 

There are two main components:
1) The queueing time. This is roughly proportional to (number of requests 
running) / (number of slots available). If we increase the capacity of the 
node, we multiply both the top and bottom by the same factor - so the queueing 
time is unchanged, unless the data transfer time is changed.
2) The data transfer time. In the above case with a slowdown, the data transfer 
time is the same, or a little shorter, because most of our bandwidth isn't 
being used. If we increase it by running more requests at once, it should reach 
where it was, and if we increase it further, transfers will start to take 
longer again.

ArneBab has analysed this in more detail and shown that the largest component 
of the queueing time is proportional to the data transfer time, hence, 
separating SSKs (which have a very short transfer time) from CHKs (which have a 
much longer transfer time) would help significantly.

The key advantage of NLM, of course, is that it routes accurately, resulting in 
10%-30% improvements in success rates, and transferring data over significantly 
fewer hops (= more effective bandwidth).
  
  Best wishes, 
  Arne
  
  
  
  Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
   Okay, I don't understand most of that, I might be able to check the math 
   if
   it was written properly, but it looks difficult. However, as far as I can
   see: - The most obvious way to increase bandwidth usage would be to
   increase the timeout time for output bandwidth liability (and at the same
   time increase the relevant block transfer timeouts). - This would increase
   the number of slots but it would also increase the number of requests
   seeking them; I don't see why it would help matters. - Running an 
   excessive
   number of requests without adjusting the block transfer timeouts would
   result in some of the transfers timing out. - It would also, as you
   mention, make SSK requests (especially failed SSK requests) even slower. -
   I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. 
   -
   As far as I can see, all the benefits of NLM re attackers are achieved by
   fair sharing. - A system which is more resistant to attacks but slower
   probably isn't all that interesting if the attacks in question are
   relatively expensive anyway. - Smaller block sizes would have a 
   significant
   efficiency cost, and would probably make load management more difficult.
   
   I apologise if you see this as rather a broadside after I encouraged you 
   to
   analyse the problem, but it is not yet a 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Bab.
Von: Matthew Toseland t...@amphibian.dyndns.org   
But the other question is, can queueing ever be helpful? It can if it allows 
us to route more accurately (which NLM clearly does), and/or to run enough 
requests in parallel that the longer time taken for the request to reach its 
destination is offset. Is this condition met?

Experience with the deployed NLM showed that even in the fully congested case 
it had success rates of 60% for HTL 18,17 and 16, compared to less than 40% for 
OLM. This means that the requests are sent over fewer hops on average, because 
find the content fewer hops away from the requester.

A download of 1MiB which is sent over 2 hops needs 2 MiB in total network 
bandwidth.
If it is sent over only 1.5 hops on average, then it needs only 1.5 MiB total 
network bandwidth.

So essentially NLM can distribute 30% more content with the same network 
resources¹. And these numbers are actual observations. The only reason why this 
did not result in increased performance is that the nodes used less than 50% of 
their allocated bandwidth² - which is a problem with the bandwidth scheduler 
and not with queueing.

Best wishes,
Arne

¹: The relevant network resource is upload bandwidth.
²: Source: observations from me and two other freenet users.

PS: How exactly the bandwidth limiter is fixed is an implementation detail. I 
think you are actually the only person who can judge how to do this most 
efficiently.

___
Schon gehört? WEB.DE hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://produkte.web.de/go/toolbar


smime.p7s
Description: S/MIME Cryptographic Signature
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl