Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Arne Babenhauserheide
Am Mittwoch, 31. August 2011, 13:25:35 schrieb Matthew Toseland:
> On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
> > what this means: if a SSK has a mean success rate of 0.16, then using
> > 0.25 as value makes sure that 95% of the possible cases don’t exhaust
> > the bandwidth. We then use only 64% of the bandwidth on average,
> > though. With 0.2, we’d get 68% of the possible distributions safe and
> > use 80% of bandwidth on average.
> Only if there are no clusters - a single requestor fetches a bunch of stuff
> that is all already there, rather than polling keys that usually aren't.
> IMHO there will be. For instance, when a new chat client is started up, a
> lot of what it does will be fetching existing messages rather than polling
> for new ones.

But these are in cache, so the routes will be very short. And SSKs with very
high HTL (18, 17, 16) have good success rates, so this code won’t affect them
much less than low-HTL SSKs (the unsuccessful ones). After HTL 15 or so, their
success rate drops massively - which is quite easy to explain: The content
mostly isn’t there.

Simply take the success-rate per HTL - as in the statistics window.

Best wishes,
Arne


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Matthew Toseland
On Monday 29 Aug 2011 20:34:01 Ian Clarke wrote:
> On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland  > wrote:
> 
> >  > >- What is the average load reported by responses this node
> > forwarded, per
> > > >remote node
> > >
> > > Ahhh, this one could be interesting - you could use it to penalise nodes
> > which spam excessively.
> >
> > Actually, thinking about it, I don't think this will work. Requests are
> > usually distributed evenly across the keyspace, and downstream nodes only
> > know they are from this node - not which previous node they are from. So the
> > pain will be shared across all the peers sending us requests, including our
> > own local requests, and won't be identifiable as due to any single node. You
> > have to do it by volume, not by reported load.
> 
> But if a node is being abusive won't the nodes its requesting from tend to
> have a higher than average load?  

If its requests differ from the specialisation of the node in general, so are 
routed to a particular subset, then they might be different. But this is not 
necessarily true if it is a deliberate attack.

> Perhaps you are right though, but this is
> the kind of thought-process we need.

To prevent a node from using an excessive proportion of our capacity, we need 
to *measure the proportion of our capacity that the node uses*. This much is 
straightforward IMHO. And that's the basis of fair sharing between peers.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Matthew Toseland
On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
> Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
> > 5) solution: count each SSK as only
> > average_SSK_success_rate * data_to_transfer_on_success.
> 
> Some more data: 
> 
> chances of having at least this many successful transfers for 40 SSKs with a 
> mean success rate of 16%: 
> 
> for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done
> 
> 0 1.0
> 1 0.999064224991
> 2 0.99193451064
> 3 0.965452714478
> 4 0.901560126912
> 5 0.788987472629
> 6 0.634602118184
> 7 0.463062835467
> 8 0.304359825607
> 9 0.179664603573
> 10 0.0952149293922
> 11 0.0453494074947
> 12 0.0194452402752
> 13 0.00752109980912
> 14 0.0026291447461
> 15 0.000832100029072
> 16 0.00023879002726
> 
> what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as 
> value makes sure that 95% of the possible cases don’t exhaust the bandwidth. 
> We then use only 64% of the bandwidth on average, though. With 0.2, we’d get 
> 68% of the possible distributions safe and use 80% of bandwidth on average.

Only if there are no clusters - a single requestor fetches a bunch of stuff 
that is all already there, rather than polling keys that usually aren't. IMHO 
there will be. For instance, when a new chat client is started up, a lot of 
what it does will be fetching existing messages rather than polling for new 
ones.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Arne Babenhauserheide
Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
> 5) solution: count each SSK as only
>   average_SSK_success_rate * data_to_transfer_on_success.

Some more data:

chances of having at least this many successful transfers for 40 SSKs with a
mean success rate of 16%:

for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done

0 1.0
1 0.999064224991
2 0.99193451064
3 0.965452714478
4 0.901560126912
5 0.788987472629
6 0.634602118184
7 0.463062835467
8 0.304359825607
9 0.179664603573
10 0.0952149293922
11 0.0453494074947
12 0.0194452402752
13 0.00752109980912
14 0.0026291447461
15 0.000832100029072
16 0.00023879002726

what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as
value makes sure that 95% of the possible cases don’t exhaust the bandwidth.
We then use only 64% of the bandwidth on average, though. With 0.2, we’d get
68% of the possible distributions safe and use 80% of bandwidth on average.

Note: this is just a binomial spread:

from math import factorial
fac = factorial
def nük(n, k):
   if k > n: return 0
   return fac(n) / (fac(k)*fac(n-k))

def binom(p, n, k):
   return nük(n, k) * p** k * (1-p)**(n-k)

def spielfähig(p, n, min_spieler):
   return sum([binom(p, n, k) for k in range(min_spieler, n+1)])


→ USK@6~ZDYdvAgMoUfG6M5Kwi7SQqyS-
gTcyFeaNN1Pf3FvY,OSOT4OEeg4xyYnwcGECZUX6~lnmYrZsz05Km7G7bvOQ,AQACAAE/bab/9/Content-
D426DC7.html


Best wishes,
Arne

signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Bab.
Von: "Matthew Toseland"    
>But the other question is, can queueing ever be helpful? It can if it allows 
>us to route more accurately (which NLM clearly does), and/or to run enough 
>requests in parallel that the longer time taken for the request to reach its 
>destination is offset. Is this condition met?

Experience with the deployed NLM showed that even in the fully congested case 
it had success rates of 60% for HTL 18,17 and 16, compared to less than 40% for 
OLM. This means that the requests are sent over fewer hops on average, because 
find the content fewer hops away from the requester.

A download of 1MiB which is sent over 2 hops needs 2 MiB in total network 
bandwidth.
If it is sent over only 1.5 hops on average, then it needs only 1.5 MiB total 
network bandwidth.

So essentially NLM can distribute 30% more content with the same network 
resources¹. And these numbers are actual observations. The only reason why this 
did not result in increased performance is that the nodes used less than 50% of 
their allocated bandwidth² - which is a problem with the bandwidth scheduler 
and not with queueing.

Best wishes,
Arne

¹: The relevant network resource is upload bandwidth.
²: Source: observations from me and two other freenet users.

PS: How exactly the bandwidth limiter is fixed is an implementation detail. I 
think you are actually the only person who can judge how to do this most 
efficiently.

___
Schon gehört? WEB.DE hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://produkte.web.de/go/toolbar


smime.p7s
Description: S/MIME Cryptographic Signature
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:18:35 Matthew Toseland wrote:
> On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
> > After discussing and going deeper into this, it became apparent that the 
> > problem is not overload of queues. I’ll repeat the result I came to here 
> > (and 
> > not in chatlog form :) ): 
> > 
> > The problem is in the load limiter: 
> > 
> > 
> > 1) we need to use all bandwidth, because latency depends on the number of  
> > finishing transfers.
> > 
> > 2) SSKs fail 85% of the time.
> > 
> > 3) the load limiter assumes that all SSKs succeed and reserves bandwidth 
> > for 
> > them.
> > 
> > 4) this now bites us, because NLM has shorter transfer times and longer 
> > wait 
> > times. So the over-reservation of bandwidth for SSKs lasts longer. 
> > 
> > 5) solution: count each SSK as only 
> > average_SSK_success_rate * data_to_transfer_on_success.
> > 
> > 
> > To (1): When we have less bandwidth, than less CHKs succeed per second, so 
> > less slots get freed and the queueing time it bigger.
> 
> Another way to look at this:
> 
> Both me and Ian have been assuming that the resource is some number 
> representing a property of the node, for example, how many requests are 
> running. But the real resource is the external bandwidth usage, which is 
> rather different.
> 
> An alternative solution - if we are worried about bursts of requests 
> succeeding and therefore timing out - is to allocate a fixed chunk of 
> bandwidth for SSKs, and not let them interfere with CHKs. Then CHKs can have 
> long transfer times, and therefore long queueing times when they fail (but 
> this is only ~ 40% of the time with NLM's accurate routing), and SSKs can 
> have short transfer times, and therefore short queueing times. This may be 
> less disruptive and should also be relatively easy to implement.
> 
> But the other question is, can queueing ever be helpful? It can if it allows 
> us to route more accurately (which NLM clearly does), and/or to run enough 
> requests in parallel that the longer time taken for the request to reach its 
> destination is offset. Is this condition met?

Yes it is!

If a request takes P seconds without queueing, and P+Q seconds with queueing, 
then if we have the same number of requests running at once, we have a slowdown 
of (P+Q)/P. However, if we start (P+Q)/P times more requests, and the queueing 
time remains constant, we can compensate for this. 

Does it? 

There are two main components:
1) The queueing time. This is roughly proportional to (number of requests 
running) / (number of slots available). If we increase the capacity of the 
node, we multiply both the top and bottom by the same factor - so the queueing 
time is unchanged, unless the data transfer time is changed.
2) The data transfer time. In the above case with a slowdown, the data transfer 
time is the same, or a little shorter, because most of our bandwidth isn't 
being used. If we increase it by running more requests at once, it should reach 
where it was, and if we increase it further, transfers will start to take 
longer again.

ArneBab has analysed this in more detail and shown that the largest component 
of the queueing time is proportional to the data transfer time, hence, 
separating SSKs (which have a very short transfer time) from CHKs (which have a 
much longer transfer time) would help significantly.

The key advantage of NLM, of course, is that it routes accurately, resulting in 
10%-30% improvements in success rates, and transferring data over significantly 
fewer hops (= more effective bandwidth).
> > 
> > Best wishes, 
> > Arne
> > 
> > 
> > 
> > Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> > > Okay, I don't understand most of that, I might be able to check the math 
> > > if
> > > it was written properly, but it looks difficult. However, as far as I can
> > > see: - The most obvious way to increase bandwidth usage would be to
> > > increase the timeout time for output bandwidth liability (and at the same
> > > time increase the relevant block transfer timeouts). - This would increase
> > > the number of slots but it would also increase the number of requests
> > > seeking them; I don't see why it would help matters. - Running an 
> > > excessive
> > > number of requests without adjusting the block transfer timeouts would
> > > result in some of the transfers timing out. - It would also, as you
> > > mention, make SSK requests (especially failed SSK requests) even slower. -
> > > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. 
> > > -
> > > As far as I can see, all the benefits of NLM re attackers are achieved by
> > > fair sharing. - A system which is more resistant to attacks but slower
> > > probably isn't all that interesting if the attacks in question are
> > > relatively expensive anyway. - Smaller block sizes would have a 
> > > significant
> > > efficiency cost, and would probably make load management more dif

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
> After discussing and going deeper into this, it became apparent that the 
> problem is not overload of queues. I’ll repeat the result I came to here (and 
> not in chatlog form :) ): 
> 
> The problem is in the load limiter: 
> 
> 
> 1) we need to use all bandwidth, because latency depends on the number of  
>   finishing transfers.
> 
> 2) SSKs fail 85% of the time.
> 
> 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for 
>   them.
> 
> 4) this now bites us, because NLM has shorter transfer times and longer wait 
>   times. So the over-reservation of bandwidth for SSKs lasts longer. 
> 
> 5) solution: count each SSK as only 
>   average_SSK_success_rate * data_to_transfer_on_success.
> 
> 
> To (1): When we have less bandwidth, than less CHKs succeed per second, so 
> less slots get freed and the queueing time it bigger.

Another way to look at this:

Both me and Ian have been assuming that the resource is some number 
representing a property of the node, for example, how many requests are 
running. But the real resource is the external bandwidth usage, which is rather 
different.

An alternative solution - if we are worried about bursts of requests succeeding 
and therefore timing out - is to allocate a fixed chunk of bandwidth for SSKs, 
and not let them interfere with CHKs. Then CHKs can have long transfer times, 
and therefore long queueing times when they fail (but this is only ~ 40% of the 
time with NLM's accurate routing), and SSKs can have short transfer times, and 
therefore short queueing times. This may be less disruptive and should also be 
relatively easy to implement.

But the other question is, can queueing ever be helpful? It can if it allows us 
to route more accurately (which NLM clearly does), and/or to run enough 
requests in parallel that the longer time taken for the request to reach its 
destination is offset. Is this condition met?
> 
> Best wishes, 
> Arne
> 
> 
> 
> Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> > Okay, I don't understand most of that, I might be able to check the math if
> > it was written properly, but it looks difficult. However, as far as I can
> > see: - The most obvious way to increase bandwidth usage would be to
> > increase the timeout time for output bandwidth liability (and at the same
> > time increase the relevant block transfer timeouts). - This would increase
> > the number of slots but it would also increase the number of requests
> > seeking them; I don't see why it would help matters. - Running an excessive
> > number of requests without adjusting the block transfer timeouts would
> > result in some of the transfers timing out. - It would also, as you
> > mention, make SSK requests (especially failed SSK requests) even slower. -
> > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
> > As far as I can see, all the benefits of NLM re attackers are achieved by
> > fair sharing. - A system which is more resistant to attacks but slower
> > probably isn't all that interesting if the attacks in question are
> > relatively expensive anyway. - Smaller block sizes would have a significant
> > efficiency cost, and would probably make load management more difficult.
> > 
> > I apologise if you see this as rather a broadside after I encouraged you to
> > analyse the problem, but it is not yet a convincing demonstration that
> > queueing is a viable strategy and not the spawn of satan! :)
> > On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> > > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > > > Yes, small tweaks have worked so well for us for the last decade,
> > > > leaving us pretty-much where we were in 2003.  No, we don't
> > > > understand how the current system works, there is no point in
> > > > trying to fix something when we don't even know what is broken.
> > > 
> > > I’d like to present a clue what is broken in NLM. Before I kill you with
> > > the> 
> > > log, here’s the result:
> > >   With NLM the latency of a request is a function of the raw bandwidth
> > >   (not so with OLM), and NLM used only half my bandwidth after it had
> > >   been deployed for 2 days (at the start much more).
> > > 
> > > τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue
> > > time (time to find the node), nlm: new load management, olm: old load
> > > management.
> > > 
> > > So first step: make sure all bandwidth gets used - maybe by allocating
> > > more slots till we use all allowed bandwidth. Better having to throttle
> > > a transfer than not using bandwidth.
> > > 
> > > *NLM should with the current network be slower than OLM by 23%. But in
> > > 18
> > > months it should actually be faster by ~8% — given Moores Law holds for
> > > upload bandwidth — because the routes are shorter.*
> > > 
> > > The main advantage of NLM is, that it should be much more resilient
> > >

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Babenhauserheide
After discussing and going deeper into this, it became apparent that the
problem is not overload of queues. I’ll repeat the result I came to here (and
not in chatlog form :) ):

The problem is in the load limiter:


1) we need to use all bandwidth, because latency depends on the number of
finishing transfers.

2) SSKs fail 85% of the time.

3) the load limiter assumes that all SSKs succeed and reserves bandwidth for
them.

4) this now bites us, because NLM has shorter transfer times and longer wait
times. So the over-reservation of bandwidth for SSKs lasts longer.

5) solution: count each SSK as only
average_SSK_success_rate * data_to_transfer_on_success.


To (1): When we have less bandwidth, than less CHKs succeed per second, so
less slots get freed and the queueing time it bigger.

Best wishes,
Arne



Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> Okay, I don't understand most of that, I might be able to check the math if
> it was written properly, but it looks difficult. However, as far as I can
> see: - The most obvious way to increase bandwidth usage would be to
> increase the timeout time for output bandwidth liability (and at the same
> time increase the relevant block transfer timeouts). - This would increase
> the number of slots but it would also increase the number of requests
> seeking them; I don't see why it would help matters. - Running an excessive
> number of requests without adjusting the block transfer timeouts would
> result in some of the transfers timing out. - It would also, as you
> mention, make SSK requests (especially failed SSK requests) even slower. -
> I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
> As far as I can see, all the benefits of NLM re attackers are achieved by
> fair sharing. - A system which is more resistant to attacks but slower
> probably isn't all that interesting if the attacks in question are
> relatively expensive anyway. - Smaller block sizes would have a significant
> efficiency cost, and would probably make load management more difficult.
>
> I apologise if you see this as rather a broadside after I encouraged you to
> analyse the problem, but it is not yet a convincing demonstration that
> queueing is a viable strategy and not the spawn of satan! :)
> On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > > Yes, small tweaks have worked so well for us for the last decade,
> > > leaving us pretty-much where we were in 2003.  No, we don't
> > > understand how the current system works, there is no point in
> > > trying to fix something when we don't even know what is broken.
> >
> > I’d like to present a clue what is broken in NLM. Before I kill you with
> > the>
> > log, here’s the result:
> > With NLM the latency of a request is a function of the raw bandwidth
> > (not so with OLM), and NLM used only half my bandwidth after it had
> > been deployed for 2 days (at the start much more).
> >
> > τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue
> > time (time to find the node), nlm: new load management, olm: old load
> > management.
> >
> > So first step: make sure all bandwidth gets used - maybe by allocating
> > more slots till we use all allowed bandwidth. Better having to throttle
> > a transfer than not using bandwidth.
> >
> > *NLM should with the current network be slower than OLM by 23%. But in
> > 18
> > months it should actually be faster by ~8% — given Moores Law holds for
> > upload bandwidth — because the routes are shorter.*
> >
> > The main advantage of NLM is, that it should be much more resilient
> > against attackers (DoS).
> >
> > Now to the log - it’s math and not cleaned up; you have been warned :)
> >
> >  SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf.
> >  queue-time: q, transfer-time: τ, hops remaining: h, total
> > hops: h₀, w: success rate
> >  ψs = τ(h) + q(h)
> >  ψf = q(h)
> >  ψ ~ w₁·ψs + (1-w₁)·ψf
> >  σs = τ(h) + q(h)
> >  σf = q(h)
> >  σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
> >  num(ψ) / num(σ) ~ 1
> >  → time ~ σ + ψ
> >  q(h) depends on timeouts, as do w₁ and w₂
> >  time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
> >  = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
> > w₂)·q(h)
> >  = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
> >  = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
> >  in the congestion case q(h) ~ timeout
> >  timeout = o
> >  timeout: o
> >  w depends on the timeout *somehow*, but inversely
> >  o=0 → w=0
> >  assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
> >  assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
> >  correction: in the congestion case: q(h) ~ min(timeout, τ(h))
> >  timeout matters for q(h) only when timeout < τ(h)
> >  I try to: I still need a dependency of w on timeout
> >  … lets call it t(w)
> >  better: w(o) :)
> >  well, if there is a timeout, we have a fixed time, but we reduce
> > the hops ...
> >  i thought w was success rate

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:32:13 Ian Clarke wrote:
> On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland  > wrote:
> 
> > On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > > proposal.  I'm assuming that we get rid of NLM, fair sharing, and
> > anything
> > > else intended to control load, and replace it with this.  We will
> > absolutely
> > > need to simulate this before we write a single line of code to deploy it.
> >
> > I have justified fair sharing in my other reply. However alternate
> > mechanisms might replace it.
> >
> > I agree we should simulate, although this might be significant work.
> >
> 
> I doubt it would be more work than trying it without simulation and failing
> for a decade, which was our previous approach.
> 
> 
> > > The core idea is that a node will include a floating point number in
> > > response to any kind of request showing how close that node is to being
> > > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > > completely saturated and must reject requests.  Clearly the goal is to
> > avoid
> > > getting anywhere near to 1.0.
> > >
> > > A node tracks several things:
> >
> > All of these apply to both locally originated requests and remotely
> > originated requests? And are they for only our direct peers?
> >
> 
> Every peer along the reply path will take note of the load value, so it will
> be a mixture of the loads of close-by peers, and more distant peers.
>  Obviously, since we are more likely to receive messages from peers that are
> "closer", our average will automatically be biased towards closer peers.

So it is from the data source? Or every node along the way?
> 
> >- What is the average load reported by responses this node forwarded,
> > per
> > >remote node
> >
> > Ahhh, this one could be interesting - you could use it to penalise nodes
> > which spam excessively.
> >
> Exactly.
> 
> > > I think, given these metrics, we should be able to do the following:
> > >
> > >- Limit our overall rate of initiating local requests based on the
> > global
> > >average reported load
> >
> > We can already do this on the basis of AIMD's, based purely on behaviour of
> > local requests. However, taking more data into account might result in more
> > accurate results - although it might also dilute the feedback.
> 
> Right, AIMD tends to control the rate at which a specific event occurs, like
> a dropped IP packet with TCP, or a dropped request as we currently use it.
>  This is a bit like signaling that you have a headache by banging your head
> against the wall.  It is also a rather coarse way to deliver feedback.

Fair point.
> 
> A floating point load number conveys more information in a non-destructive
> way.
> 
> In any case, this part is "safe": it's not going to cause feedback problems,
> > on its own. Conversely, it doesn't deal with problems such as network
> > bottlenecks.
> 
> Well, it depends.  If we are tracking load on a per-outbound-connection
> basis then we can potentially take action to manage load per-connection in
> addition to overall.

Not sure I follow here. Routing by load is bad, unless it is severely limited.
> 
> > However, while AIMD's are based on the *global* average load (or rather the
> > load of all the nodes along the chain), it sounds like you are proposing to
> > only report the *local* load of our direct peers? How are you going to
> > compute the global average load?
> 
> No, its not the local load of our direct peers.  Its the load reported in
> all reply messages that we forward, some of those will be from our direct
> peers, but most will be from remote peers.
> 
> > >- Limit our rate of local requests based on the average load of the
> > >connection to the peer it would need to be forwarded to
> >
> > Not sure I follow here. Are you proposing that we initiate more requests in
> > keyspace areas where there is lower load?
> 
> Well, kinda, except the opposite.  We might reduce requests in keyspaces
> where there is higher load.
> 
> 
> > >- Detect when remote peers are abusing the system by disregarding load
> > -
> > >as evidenced by a significantly higher average load in replies
> > forwarded to
> > >them
> >
> > Okay, that is useful - although I doubt it would be an unambiguous
> > detection, and what would we do about it? We would have to reject requests,
> > and if we're not sure, or if it could be a downstream troublemaker, couldn't
> > that result in misrouting and make things worse? On the other hand, any
> > malicious node would be most easily detected by the nodes directly connected
> > to it, so this could work much the same way as fair sharing - provided that
> > it can reject or terminate the requests.
> 
> No, we probably wouldn't start rejecting requests - more likely we would
> disconnect from that node.  Obviously we'd need to be fairly sure of abusive
> behavior before doing that - 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
Okay, I don't understand most of that, I might be able to check the math if it 
was written properly, but it looks difficult. However, as far as I can see:
- The most obvious way to increase bandwidth usage would be to increase the 
timeout time for output bandwidth liability (and at the same time increase the 
relevant block transfer timeouts).
- This would increase the number of slots but it would also increase the number 
of requests seeking them; I don't see why it would help matters.
- Running an excessive number of requests without adjusting the block transfer 
timeouts would result in some of the transfers timing out.
- It would also, as you mention, make SSK requests (especially failed SSK 
requests) even slower.
- I am quite confident that Moore's Law DOES NOT hold for upload bandwidth.
- As far as I can see, all the benefits of NLM re attackers are achieved by 
fair sharing.
- A system which is more resistant to attacks but slower probably isn't all 
that interesting if the attacks in question are relatively expensive anyway.
- Smaller block sizes would have a significant efficiency cost, and would 
probably make load management more difficult.

I apologise if you see this as rather a broadside after I encouraged you to 
analyse the problem, but it is not yet a convincing demonstration that queueing 
is a viable strategy and not the spawn of satan! :)

On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > Yes, small tweaks have worked so well for us for the last decade, leaving us
> > pretty-much where we were in 2003.  No, we don't understand how the current
> > system works, there is no point in trying to fix something when we don't
> > even know what is broken.
> 
> I’d like to present a clue what is broken in NLM. Before I kill you with the 
> log, here’s the result: 
> 
>   With NLM the latency of a request is a function of the raw bandwidth 
>   (not so with OLM), and NLM used only half my bandwidth after it had 
>   been deployed for 2 days (at the start much more).
> 
> τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time 
> (time to find the node), nlm: new load management, olm: old load management.
> 
> So first step: make sure all bandwidth gets used - maybe by allocating more 
> slots till we use all allowed bandwidth. Better having to throttle a transfer 
> than not using bandwidth.
> 
> *NLM should with the current network be slower than OLM by 23%. But in 18 
> months it should actually be faster by ~8% — given Moores Law holds for 
> upload 
> bandwidth — because the routes are shorter.*
> 
> The main advantage of NLM is, that it should be much more resilient against 
> attackers (DoS).
> 
> Now to the log - it’s math and not cleaned up; you have been warned :)
> 
>  SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. 
>  queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀, 
> w: success rate
>  ψs = τ(h) + q(h)
>  ψf = q(h)
>  ψ ~ w₁·ψs + (1-w₁)·ψf
>  σs = τ(h) + q(h)
>  σf = q(h)
>  σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
>  num(ψ) / num(σ) ~ 1
>  → time ~ σ + ψ
>  q(h) depends on timeouts, as do w₁ and w₂
>  time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
>  = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
> w₂)·q(h)
>  = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
>  = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
>  in the congestion case q(h) ~ timeout
>  timeout = o
>  timeout: o
>  w depends on the timeout *somehow*, but inversely
>  o=0 → w=0
>  assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
>  assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
>  correction: in the congestion case: q(h) ~ min(timeout, τ(h))
>  timeout matters for q(h) only when timeout < τ(h)
>  I try to: I still need a dependency of w on timeout
>  … lets call it t(w)
>  better: w(o) :)
>  well, if there is a timeout, we have a fixed time, but we reduce the 
> hops ...
>  i thought w was success rate
>  ah! 
>  and the success rates where in the NLM stats
>  going mostly smoothly from 60% to 0%
>  for the HTL
>  right, success rate peaks at 18 or sometimes 16
>  what are w1 vs w2?
>  chk vs ssk i guess
>  yes
> -*- toad_ thinks considering both is probably overambitious at this stage?
>  should not be too bad: SSKs drop much more rapidly at decreasing 
> hops
>  hops→HTL
>  ψs is time for a successful chk; ψf is time for a failed chk ... in 
> which case h in the first instance is low, and in the second instance is h0
>  yes
>  okay, i don't follow this line: time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + 
> (1-w₂)·ψf
>  i thought w2 related to SSKs?
>  uh, yes…
>  time =  w₁·ψs + (1-w₁)·ψf +  w₂·σs + (1-w₂)·σf
>  you have to appreciate i'm only just getting back into maths and 
> physics after 12 years ...
>  (retaking a-levels to get a degree)
>  no probs, I’m also no expert in this. I try to get a relation 
> between the time and the timeout, so we can try to find a minimum
>  in any case, there are two different h's for the two

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Babenhauserheide
Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> Yes, small tweaks have worked so well for us for the last decade, leaving us
> pretty-much where we were in 2003.  No, we don't understand how the current
> system works, there is no point in trying to fix something when we don't
> even know what is broken.

I’d like to present a clue what is broken in NLM. Before I kill you with the
log, here’s the result:


With NLM the latency of a request is a function of the raw bandwidth
(not so with OLM), and NLM used only half my bandwidth after it had
been deployed for 2 days (at the start much more).


τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time
(time to find the node), nlm: new load management, olm: old load management.



So first step: make sure all bandwidth gets used - maybe by allocating more
slots till we use all allowed bandwidth. Better having to throttle a transfer
than not using bandwidth.


*NLM should with the current network be slower than OLM by 23%. But in 18
months it should actually be faster by ~8% — given Moores Law holds for upload
bandwidth — because the routes are shorter.*


The main advantage of NLM is, that it should be much more resilient against
attackers (DoS).


Now to the log - it’s math and not cleaned up; you have been warned :)


 SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf.
 queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀,
w: success rate
 ψs = τ(h) + q(h)
 ψf = q(h)
 ψ ~ w₁·ψs + (1-w₁)·ψf
 σs = τ(h) + q(h)
 σf = q(h)
 σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
 num(ψ) / num(σ) ~ 1
 → time ~ σ + ψ
 q(h) depends on timeouts, as do w₁ and w₂
 time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
 = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
w₂)·q(h)
 = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
 = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
 in the congestion case q(h) ~ timeout
 timeout = o
 timeout: o
 w depends on the timeout *somehow*, but inversely
 o=0 → w=0
 assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
 assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
 correction: in the congestion case: q(h) ~ min(timeout, τ(h))
 timeout matters for q(h) only when timeout < τ(h)
 I try to: I still need a dependency of w on timeout
 … lets call it t(w)
 better: w(o) :)
 well, if there is a timeout, we have a fixed time, but we reduce the
hops ...
 i thought w was success rate
 ah!
 and the success rates where in the NLM stats
 going mostly smoothly from 60% to 0%
 for the HTL
 right, success rate peaks at 18 or sometimes 16
 what are w1 vs w2?
 chk vs ssk i guess
 yes
-*- toad_ thinks considering both is probably overambitious at this stage?
 should not be too bad: SSKs drop much more rapidly at decreasing
hops
 hops→HTL
 ψs is time for a successful chk; ψf is time for a failed chk ... in
which case h in the first instance is low, and in the second instance is h0
 yes
 okay, i don't follow this line: time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs +
(1-w₂)·ψf
 i thought w2 related to SSKs?
 uh, yes…
 time =  w₁·ψs + (1-w₁)·ψf +  w₂·σs + (1-w₂)·σf
 you have to appreciate i'm only just getting back into maths and
physics after 12 years ...
 (retaking a-levels to get a degree)
 no probs, I’m also no expert in this. I try to get a relation
between the time and the timeout, so we can try to find a minimum
 in any case, there are two different h's for the two uses of q(h) - h0
and h_avg
 h_avg for success and h0 for failure
 hm, yes
 which makes this harder…
 it’s wrong anyway… the q(h_avg) was missing
 h_avg is somewhere between 5 and 10 imho
 at least it is if everything is working well and the input load isn't
all really popular stuff (in which case it's answered quickly and can be
ignored)
 = τ(h) · (w₁+w₂) + q(h) · (2-w₁-w₂) + q(h_avg) · (w₁+w₂)
 would have been too easy :)
 okay so here q(h) means q(h0) i.e. h = h0, max hops?
 jepp, and max hops sinks with falling timeout
 hmm?
 the max actual hops
 on the upside, q() is linear
<-- Torgal (~Torgal@78.251.49.8) hat das Netzwerk verlassen (Ping timeout: 276
seconds)
 hopefully
 yes: q(h) = h·o
 (in the congestion case)
 the problem i have is it looks like q(1) ~= time [ for a full request
], unless load is very low
 so τ « q?
 τ much smaller than q?
 of course it's bounded by timeouts, but i'd expect a runaway feedback
loop until it reaches heavy timeouts and effectively cuts the htl
 well, with OLM, success time for a CHK is 1m25s, unsuccessful is
19sec, so transfer time is at least 1 minute
 and less than 1m25; but with NLM, unsuccessful is 3 min+
 well, for SSKs in OLM the first 6 hops are the most successful, later
ones only contribute 1% success, which piles up to ~ 12%
 okay...
 (only possible because 85% are unsuccessful)
 (otherwise this would be wrong: the contribution of later ones would
be smaller)
 from the numbers q(h₀) ~ τ(h_avg)
 well, queueing time on any hop is the time it takes to get a slot to
route to, which is roughly equal to the time it takes for a request to
complete divided b

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland  wrote:

>  > >- What is the average load reported by responses this node
> forwarded, per
> > >remote node
> >
> > Ahhh, this one could be interesting - you could use it to penalise nodes
> which spam excessively.
>
> Actually, thinking about it, I don't think this will work. Requests are
> usually distributed evenly across the keyspace, and downstream nodes only
> know they are from this node - not which previous node they are from. So the
> pain will be shared across all the peers sending us requests, including our
> own local requests, and won't be identifiable as due to any single node. You
> have to do it by volume, not by reported load.
>

But if a node is being abusive won't the nodes its requesting from tend to
have a higher than average load?  Perhaps you are right though, but this is
the kind of thought-process we need.

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: i...@freenetproject.org
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland  wrote:

> On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > proposal.  I'm assuming that we get rid of NLM, fair sharing, and
> anything
> > else intended to control load, and replace it with this.  We will
> absolutely
> > need to simulate this before we write a single line of code to deploy it.
>
> I have justified fair sharing in my other reply. However alternate
> mechanisms might replace it.
>
> I agree we should simulate, although this might be significant work.
>

I doubt it would be more work than trying it without simulation and failing
for a decade, which was our previous approach.


> > The core idea is that a node will include a floating point number in
> > response to any kind of request showing how close that node is to being
> > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > completely saturated and must reject requests.  Clearly the goal is to
> avoid
> > getting anywhere near to 1.0.
> >
> > A node tracks several things:
>
> All of these apply to both locally originated requests and remotely
> originated requests? And are they for only our direct peers?
>

Every peer along the reply path will take note of the load value, so it will
be a mixture of the loads of close-by peers, and more distant peers.
 Obviously, since we are more likely to receive messages from peers that are
"closer", our average will automatically be biased towards closer peers.

>- What is the average load reported by responses this node forwarded,
> per
> >remote node
>
> Ahhh, this one could be interesting - you could use it to penalise nodes
> which spam excessively.
>

Exactly.


> > I think, given these metrics, we should be able to do the following:
> >
> >- Limit our overall rate of initiating local requests based on the
> global
> >average reported load
>
> We can already do this on the basis of AIMD's, based purely on behaviour of
> local requests. However, taking more data into account might result in more
> accurate results - although it might also dilute the feedback.
>

Right, AIMD tends to control the rate at which a specific event occurs, like
a dropped IP packet with TCP, or a dropped request as we currently use it.
 This is a bit like signaling that you have a headache by banging your head
against the wall.  It is also a rather coarse way to deliver feedback.

A floating point load number conveys more information in a non-destructive
way.

In any case, this part is "safe": it's not going to cause feedback problems,
> on its own. Conversely, it doesn't deal with problems such as network
> bottlenecks.
>

Well, it depends.  If we are tracking load on a per-outbound-connection
basis then we can potentially take action to manage load per-connection in
addition to overall.


> However, while AIMD's are based on the *global* average load (or rather the
> load of all the nodes along the chain), it sounds like you are proposing to
> only report the *local* load of our direct peers? How are you going to
> compute the global average load?
>

No, its not the local load of our direct peers.  Its the load reported in
all reply messages that we forward, some of those will be from our direct
peers, but most will be from remote peers.


> >- Limit our rate of local requests based on the average load of the
> >connection to the peer it would need to be forwarded to
>
> Not sure I follow here. Are you proposing that we initiate more requests in
> keyspace areas where there is lower load?
>

Well, kinda, except the opposite.  We might reduce requests in keyspaces
where there is higher load.


> >- Detect when remote peers are abusing the system by disregarding load
> -
> >as evidenced by a significantly higher average load in replies
> forwarded to
> >them
>
> Okay, that is useful - although I doubt it would be an unambiguous
> detection, and what would we do about it? We would have to reject requests,
> and if we're not sure, or if it could be a downstream troublemaker, couldn't
> that result in misrouting and make things worse? On the other hand, any
> malicious node would be most easily detected by the nodes directly connected
> to it, so this could work much the same way as fair sharing - provided that
> it can reject or terminate the requests.
>

No, we probably wouldn't start rejecting requests - more likely we would
disconnect from that node.  Obviously we'd need to be fairly sure of abusive
behavior before doing that - but abuse should be rare.


> Arguably fair sharing achieves the same objective. It could be extended in
> various ways e.g. if we are relaying a lot of slow-down notices to a peer we
> could start to think about squeezing the peer's allocation.
>

Yes, except it doesn't work, and its too complicated to reason about.


> > Of course, lots of questions:
> >
> >- How do we compute the averages?  A decayi

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:09:58 Matthew Toseland wrote:
> On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
> > else intended to control load, and replace it with this.  We will absolutely
> > need to simulate this before we write a single line of code to deploy it.
> 
> I have justified fair sharing in my other reply. However alternate mechanisms 
> might replace it.
> 
> I agree we should simulate, although this might be significant work.
> > 
> > The core idea is that a node will include a floating point number in
> > response to any kind of request showing how close that node is to being
> > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > completely saturated and must reject requests.  Clearly the goal is to avoid
> > getting anywhere near to 1.0.
> > 
> > A node tracks several things:
> 
> All of these apply to both locally originated requests and remotely 
> originated requests? And are they for only our direct peers?
> > 
> >- What is the overall average load reported by responses this node has
> >received
> >- What is the average load reported by responses this node has received,
> >per remote node
> >- What is the average load reported by responses this node forwarded, per
> >remote node
> 
> Ahhh, this one could be interesting - you could use it to penalise nodes 
> which spam excessively.

Actually, thinking about it, I don't think this will work. Requests are usually 
distributed evenly across the keyspace, and downstream nodes only know they are 
from this node - not which previous node they are from. So the pain will be 
shared across all the peers sending us requests, including our own local 
requests, and won't be identifiable as due to any single node. You have to do 
it by volume, not by reported load.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
> else intended to control load, and replace it with this.  We will absolutely
> need to simulate this before we write a single line of code to deploy it.

I have justified fair sharing in my other reply. However alternate mechanisms 
might replace it.

I agree we should simulate, although this might be significant work.
> 
> The core idea is that a node will include a floating point number in
> response to any kind of request showing how close that node is to being
> overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> completely saturated and must reject requests.  Clearly the goal is to avoid
> getting anywhere near to 1.0.
> 
> A node tracks several things:

All of these apply to both locally originated requests and remotely originated 
requests? And are they for only our direct peers?
> 
>- What is the overall average load reported by responses this node has
>received
>- What is the average load reported by responses this node has received,
>per remote node
>- What is the average load reported by responses this node forwarded, per
>remote node

Ahhh, this one could be interesting - you could use it to penalise nodes which 
spam excessively.
> 
> I think, given these metrics, we should be able to do the following:
> 
>- Limit our overall rate of initiating local requests based on the global
>average reported load

We can already do this on the basis of AIMD's, based purely on behaviour of 
local requests. However, taking more data into account might result in more 
accurate results - although it might also dilute the feedback.

In any case, this part is "safe": it's not going to cause feedback problems, on 
its own. Conversely, it doesn't deal with problems such as network bottlenecks.

However, while AIMD's are based on the *global* average load (or rather the 
load of all the nodes along the chain), it sounds like you are proposing to 
only report the *local* load of our direct peers? How are you going to compute 
the global average load?

>- Limit our rate of local requests based on the average load of the
>connection to the peer it would need to be forwarded to

Not sure I follow here. Are you proposing that we initiate more requests in 
keyspace areas where there is lower load?

>- Detect when remote peers are abusing the system by disregarding load -
>as evidenced by a significantly higher average load in replies forwarded to
>them

Okay, that is useful - although I doubt it would be an unambiguous detection, 
and what would we do about it? We would have to reject requests, and if we're 
not sure, or if it could be a downstream troublemaker, couldn't that result in 
misrouting and make things worse? On the other hand, any malicious node would 
be most easily detected by the nodes directly connected to it, so this could 
work much the same way as fair sharing - provided that it can reject or 
terminate the requests.

Arguably fair sharing achieves the same objective. It could be extended in 
various ways e.g. if we are relaying a lot of slow-down notices to a peer we 
could start to think about squeezing the peer's allocation.
> 
> Of course, lots of questions:
> 
>- How do we compute the averages?  A decaying running average of some
>kind?  What time period?

Yay more tunables. :(

>- How do we translate load into a desired rate of requests?

And even more tunables. :(

>- What criteria indicate that a peer is abusing the system?  What is the
>remedy?

In the current system, don't accept their requests.

In ANY system, there will always be a borderline below which they can get away 
with it. We can either try to make this as small as possible or try not to hit 
legitimate traffic too hard.
> 
> This is basically control theory stuff, and we'll need robust answers to
> these questions before we proceed (ideally with a theoretical rather than
> experimental foundation).

IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much the 
same without having to wait years for our theoreticians to wake up.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl