[freenet-dev] Beyond New Load Management: Thread Usage

2011-09-06 Thread re...@regin.de
Hi Toad!

First: thx for your work and don't give up the NLM!

First impressions were overall positiv, when I started activating NLM on
this node about 138x. My in/out+HTL success rates went up noticeable.
Backoffs stayed low. Only thread usage raised a bit +20-30%.

Serveral days and versions later thread usage repeatedly climbed beyond
my limit after some time of node restart. I always raised the limit; to
500, 750, 1000... Then - when NLM was deployed networkwide with 139x -
my thread usage got completly insane >1250 and stayed there!

Stopped local activity like FT, sone, filesharing and restartet the node
but this doesen't matters much - threadusage was climbing fast again
_after some time_. Everytime threadlimit had been reached my node got
slow, downloads were staling, backoffs and RejectedOverloads were
increasing fast, RequestSender>50%, Clients got disconnected.

So maybe this happend to many people, especialy those using the default
thread setting. Then most nodes were running on the edge of local
threadqueue and continously dropping threads. Whole network traffic
could have suffered from "randomly" dropped/not started transfers, which
again raised networkwide threadusage itself...

I did not look into code - does this make sense ? Maybe just the thread
queue/handling need some prio-/optimisation ?


In the meantime I reactivated NLM here with 1401: My node seems to run
better then ever - and this for 3 day now. Threads <450, HTL >55%,
outbound constantly about 75% of limit set, low backoffs & Co.


cya
Regin








-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Arne Babenhauserheide
Am Mittwoch, 31. August 2011, 13:25:35 schrieb Matthew Toseland:
> On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
> > what this means: if a SSK has a mean success rate of 0.16, then using
> > 0.25 as value makes sure that 95% of the possible cases don?t exhaust
> > the bandwidth. We then use only 64% of the bandwidth on average,
> > though. With 0.2, we?d get 68% of the possible distributions safe and
> > use 80% of bandwidth on average.
> Only if there are no clusters - a single requestor fetches a bunch of stuff
> that is all already there, rather than polling keys that usually aren't.
> IMHO there will be. For instance, when a new chat client is started up, a
> lot of what it does will be fetching existing messages rather than polling
> for new ones.

But these are in cache, so the routes will be very short. And SSKs with very 
high HTL (18, 17, 16) have good success rates, so this code won?t affect them 
much less than low-HTL SSKs (the unsuccessful ones). After HTL 15 or so, their 
success rate drops massively - which is quite easy to explain: The content 
mostly isn?t there. 

Simply take the success-rate per HTL - as in the statistics window. 

Best wishes, 
Arne
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 316 bytes
Desc: This is a digitally signed message part.
URL: 



[freenet-dev] Beyond New Load Management

2011-08-31 Thread Matthew Toseland
On Tuesday 30 Aug 2011 18:27:03 Ian Clarke wrote:
> On Tue, Aug 30, 2011 at 11:49 AM, Robert Hailey
> wrote:
> 
> > On 2011/08/29 (Aug), at 12:58 PM, Ian Clarke wrote:
> >
> > The problem is that we come up with solutions that are too complicated to
> > analyze or fix when they don't work
> >
> > The cause is complexity, which just grows and grows as we try to fix
> > problems we don't understand by layering on more complexity.
> >
> >
> > And what is the cause of that? Is the problem really one of *behavior*?
> > emotions? workflow? organization?
> 
> A combination, but at its core I think its a failure to recognize that most
> working systems are built on principles that are simple enough to understand
> that they are either "obviously" correct, or can easily be proven to be
> correct.  For example, AIMD, which is TCP's load management system, is
> simple enough that you can describe the basic idea in a sentence.  It is
> also fairly obvious that it will work, or at least its easy to simulate it
> and convince yourself that it will work.

TCP as a whole is not simple.
> 
> In contrast, Freenet's current load management system is a combination of
> mechanisms that are perhaps individually of comparable complexity to AIMD
> (in fact, AIMD is part of it although used in quite a different context to
> how it is used in TCP), but together they are far more complex, and interact
> in ways that are hard to predict.
> 
> For example, in a conversation with Matthew a few days ago, he realized that
> the "fair sharing" mechanism which tries to allocate resources fairly among
> "downstream" nodes, was probably generating a bunch of rejected messages
> when it transitions from one mode of operation to another, with all kinds of
> negative consequences.  Who knows how many other unintended interactions
> there are?

TCP has never had major architectural bugs.

Except for all the ones it has had. For instance, it took many years to figure 
out that increasing the window size when you're not using it fully is a bad 
idea.
> 
> > Matthew has presented some very real problems which he is trying to work
> > around (with much frustration, I'm sure). I think he needs more leverage
> 
> My criticism is that Matthew's proposals for "fixing" the problem follow
> exactly the same pattern of all the previous "fixes" that turned out to have
> either no effect, or a negative effect.  It starts out with a bunch of
> hand-wavey hypotheses about what the problem might be, which is followed by
> a bunch of fixes for what might be the problem.  There is no evidence that
> these problems are real, I think a big part of the reason is that the
> existing system is too complicated to debug.

There is ample evidence that these problems are real. For instance, there is 
the classic result that triple inserting is far, far, far more effective in 
terms of data persistence than single inserting - FOR THE SAME KEY! Given that 
inserts are always routed the same way, the only possible explanation is that 
the insert is being rejected by many of the nodes on the first attempt, and 
inserting it 3 times means either it is routed better, or it is accepted by the 
right nodes - either way, spurious rejections are the problem.
> 
> I know it sounds like I'm beating up on Matthew here, and that isn't my
> intention, he is following a precedent set by me and others that have long
> since left the project.  Having (I hope) learned some tough lessons, and
> recognized the folly in our past approach, I'm now hoping its not too late
> to rescue the project from an ignominious death.  I think we're talking
> about weeks of work here, not months, and frankly I don't think we've got a
> choice if the project is to survive.  Freenet is no use to anyone if it
> doesn't work, regardless of how cool our goals are, or how clever our
> algorithms and heuristics.

I think we are talking about months. We'll be arguing about it for weeks, it'll 
take weeks to write good simulations and a lot more weeks to tweak them until 
they actually work, and then it'll take months to implement a new system. For 
instance, NLM was originally intended to be relatively simple. But it turned 
out that not only was it more complex in practice than in theory, it also 
touched on a lot of other areas (mostly related to predicting load on our 
peers). The same might be true of any new mechanism, although admittedly many 
of the bugs are gone now. I have explained how NLM could be tweaked to be 
workable elsewhere. I agree that AIMD's aren't ideal but anything else would be 
much more work, so can be postponed for now. And ArneBab has a basic 
mathematical model which explains why NLM had such problems, and why it should 
be fixable.
> 
> > If you read my suggestion below, we can discuss how it would allow:
> >
> > With an investment of developer time, we could separate the current freenet
> > code into three interfaced sections (link-layer, routing-layer,
> > 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Matthew Toseland
On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
> Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
> > 5) solution: count each SSK as only
> > average_SSK_success_rate * data_to_transfer_on_success.
> 
> Some more data: 
> 
> chances of having at least this many successful transfers for 40 SSKs with a 
> mean success rate of 16%: 
> 
> for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done
> 
> 0 1.0
> 1 0.999064224991
> 2 0.99193451064
> 3 0.965452714478
> 4 0.901560126912
> 5 0.788987472629
> 6 0.634602118184
> 7 0.463062835467
> 8 0.304359825607
> 9 0.179664603573
> 10 0.0952149293922
> 11 0.0453494074947
> 12 0.0194452402752
> 13 0.00752109980912
> 14 0.0026291447461
> 15 0.000832100029072
> 16 0.00023879002726
> 
> what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as 
> value makes sure that 95% of the possible cases don?t exhaust the bandwidth. 
> We then use only 64% of the bandwidth on average, though. With 0.2, we?d get 
> 68% of the possible distributions safe and use 80% of bandwidth on average.

Only if there are no clusters - a single requestor fetches a bunch of stuff 
that is all already there, rather than polling keys that usually aren't. IMHO 
there will be. For instance, when a new chat client is started up, a lot of 
what it does will be fetching existing messages rather than polling for new 
ones.
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: 



Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Matthew Toseland
On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
 Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
  5) solution: count each SSK as only
  average_SSK_success_rate * data_to_transfer_on_success.
 
 Some more data: 
 
 chances of having at least this many successful transfers for 40 SSKs with a 
 mean success rate of 16%: 
 
 for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done
 
 0 1.0
 1 0.999064224991
 2 0.99193451064
 3 0.965452714478
 4 0.901560126912
 5 0.788987472629
 6 0.634602118184
 7 0.463062835467
 8 0.304359825607
 9 0.179664603573
 10 0.0952149293922
 11 0.0453494074947
 12 0.0194452402752
 13 0.00752109980912
 14 0.0026291447461
 15 0.000832100029072
 16 0.00023879002726
 
 what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as 
 value makes sure that 95% of the possible cases don’t exhaust the bandwidth. 
 We then use only 64% of the bandwidth on average, though. With 0.2, we’d get 
 68% of the possible distributions safe and use 80% of bandwidth on average.

Only if there are no clusters - a single requestor fetches a bunch of stuff 
that is all already there, rather than polling keys that usually aren't. IMHO 
there will be. For instance, when a new chat client is started up, a lot of 
what it does will be fetching existing messages rather than polling for new 
ones.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Matthew Toseland
On Monday 29 Aug 2011 20:34:01 Ian Clarke wrote:
 On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland t...@amphibian.dyndns.org
  wrote:
 
- What is the average load reported by responses this node
  forwarded, per
   remote node
  
   Ahhh, this one could be interesting - you could use it to penalise nodes
  which spam excessively.
 
  Actually, thinking about it, I don't think this will work. Requests are
  usually distributed evenly across the keyspace, and downstream nodes only
  know they are from this node - not which previous node they are from. So the
  pain will be shared across all the peers sending us requests, including our
  own local requests, and won't be identifiable as due to any single node. You
  have to do it by volume, not by reported load.
 
 But if a node is being abusive won't the nodes its requesting from tend to
 have a higher than average load?  

If its requests differ from the specialisation of the node in general, so are 
routed to a particular subset, then they might be different. But this is not 
necessarily true if it is a deliberate attack.

 Perhaps you are right though, but this is
 the kind of thought-process we need.

To prevent a node from using an excessive proportion of our capacity, we need 
to *measure the proportion of our capacity that the node uses*. This much is 
straightforward IMHO. And that's the basis of fair sharing between peers.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management

2011-08-31 Thread Matthew Toseland
On Tuesday 30 Aug 2011 18:27:03 Ian Clarke wrote:
 On Tue, Aug 30, 2011 at 11:49 AM, Robert Hailey
 rob...@freenetproject.orgwrote:
 
  On 2011/08/29 (Aug), at 12:58 PM, Ian Clarke wrote:
 
  The problem is that we come up with solutions that are too complicated to
  analyze or fix when they don't work
 
  The cause is complexity, which just grows and grows as we try to fix
  problems we don't understand by layering on more complexity.
 
 
  And what is the cause of that? Is the problem really one of *behavior*?
  emotions? workflow? organization?
 
 A combination, but at its core I think its a failure to recognize that most
 working systems are built on principles that are simple enough to understand
 that they are either obviously correct, or can easily be proven to be
 correct.  For example, AIMD, which is TCP's load management system, is
 simple enough that you can describe the basic idea in a sentence.  It is
 also fairly obvious that it will work, or at least its easy to simulate it
 and convince yourself that it will work.

TCP as a whole is not simple.
 
 In contrast, Freenet's current load management system is a combination of
 mechanisms that are perhaps individually of comparable complexity to AIMD
 (in fact, AIMD is part of it although used in quite a different context to
 how it is used in TCP), but together they are far more complex, and interact
 in ways that are hard to predict.
 
 For example, in a conversation with Matthew a few days ago, he realized that
 the fair sharing mechanism which tries to allocate resources fairly among
 downstream nodes, was probably generating a bunch of rejected messages
 when it transitions from one mode of operation to another, with all kinds of
 negative consequences.  Who knows how many other unintended interactions
 there are?

TCP has never had major architectural bugs.

Except for all the ones it has had. For instance, it took many years to figure 
out that increasing the window size when you're not using it fully is a bad 
idea.
 
  Matthew has presented some very real problems which he is trying to work
  around (with much frustration, I'm sure). I think he needs more leverage
 
 My criticism is that Matthew's proposals for fixing the problem follow
 exactly the same pattern of all the previous fixes that turned out to have
 either no effect, or a negative effect.  It starts out with a bunch of
 hand-wavey hypotheses about what the problem might be, which is followed by
 a bunch of fixes for what might be the problem.  There is no evidence that
 these problems are real, I think a big part of the reason is that the
 existing system is too complicated to debug.

There is ample evidence that these problems are real. For instance, there is 
the classic result that triple inserting is far, far, far more effective in 
terms of data persistence than single inserting - FOR THE SAME KEY! Given that 
inserts are always routed the same way, the only possible explanation is that 
the insert is being rejected by many of the nodes on the first attempt, and 
inserting it 3 times means either it is routed better, or it is accepted by the 
right nodes - either way, spurious rejections are the problem.
 
 I know it sounds like I'm beating up on Matthew here, and that isn't my
 intention, he is following a precedent set by me and others that have long
 since left the project.  Having (I hope) learned some tough lessons, and
 recognized the folly in our past approach, I'm now hoping its not too late
 to rescue the project from an ignominious death.  I think we're talking
 about weeks of work here, not months, and frankly I don't think we've got a
 choice if the project is to survive.  Freenet is no use to anyone if it
 doesn't work, regardless of how cool our goals are, or how clever our
 algorithms and heuristics.

I think we are talking about months. We'll be arguing about it for weeks, it'll 
take weeks to write good simulations and a lot more weeks to tweak them until 
they actually work, and then it'll take months to implement a new system. For 
instance, NLM was originally intended to be relatively simple. But it turned 
out that not only was it more complex in practice than in theory, it also 
touched on a lot of other areas (mostly related to predicting load on our 
peers). The same might be true of any new mechanism, although admittedly many 
of the bugs are gone now. I have explained how NLM could be tweaked to be 
workable elsewhere. I agree that AIMD's aren't ideal but anything else would be 
much more work, so can be postponed for now. And ArneBab has a basic 
mathematical model which explains why NLM had such problems, and why it should 
be fixable.
 
  If you read my suggestion below, we can discuss how it would allow:
 
  With an investment of developer time, we could separate the current freenet
  code into three interfaced sections (link-layer, routing-layer,
  user/client-interface-layer).
 
  If we then were to modify the outer layers to accept 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-31 Thread Arne Babenhauserheide
Am Mittwoch, 31. August 2011, 13:25:35 schrieb Matthew Toseland:
 On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote:
  what this means: if a SSK has a mean success rate of 0.16, then using
  0.25 as value makes sure that 95% of the possible cases don’t exhaust
  the bandwidth. We then use only 64% of the bandwidth on average,
  though. With 0.2, we’d get 68% of the possible distributions safe and
  use 80% of bandwidth on average.
 Only if there are no clusters - a single requestor fetches a bunch of stuff
 that is all already there, rather than polling keys that usually aren't.
 IMHO there will be. For instance, when a new chat client is started up, a
 lot of what it does will be fetching existing messages rather than polling
 for new ones.

But these are in cache, so the routes will be very short. And SSKs with very
high HTL (18, 17, 16) have good success rates, so this code won’t affect them
much less than low-HTL SSKs (the unsuccessful ones). After HTL 15 or so, their
success rate drops massively - which is quite easy to explain: The content
mostly isn’t there.

Simply take the success-rate per HTL - as in the statistics window.

Best wishes,
Arne


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Arne Babenhauserheide
Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
> 5) solution: count each SSK as only
>   average_SSK_success_rate * data_to_transfer_on_success.

Some more data: 

chances of having at least this many successful transfers for 40 SSKs with a 
mean success rate of 16%: 

for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done

0 1.0
1 0.999064224991
2 0.99193451064
3 0.965452714478
4 0.901560126912
5 0.788987472629
6 0.634602118184
7 0.463062835467
8 0.304359825607
9 0.179664603573
10 0.0952149293922
11 0.0453494074947
12 0.0194452402752
13 0.00752109980912
14 0.0026291447461
15 0.000832100029072
16 0.00023879002726

what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as 
value makes sure that 95% of the possible cases don?t exhaust the bandwidth. 
We then use only 64% of the bandwidth on average, though. With 0.2, we?d get 
68% of the possible distributions safe and use 80% of bandwidth on average.

Note: this is just a binomial spread: 

from math import factorial
fac = factorial
def n?k(n, k): 
   if k > n: return 0
   return fac(n) / (fac(k)*fac(n-k))

def binom(p, n, k): 
   return n?k(n, k) * p** k * (1-p)**(n-k)

def spielf?hig(p, n, min_spieler): 
   return sum([binom(p, n, k) for k in range(min_spieler, n+1)])


? USK at 6~ZDYdvAgMoUfG6M5Kwi7SQqyS-
gTcyFeaNN1Pf3FvY,OSOT4OEeg4xyYnwcGECZUX6~lnmYrZsz05Km7G7bvOQ,AQACAAE/bab/9/Content-
D426DC7.html


Best wishes, 
Arne
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 316 bytes
Desc: This is a digitally signed message part.
URL: 



[freenet-dev] Beyond New Load Management

2011-08-30 Thread Ian Clarke
On Tue, Aug 30, 2011 at 11:49 AM, Robert Hailey
wrote:

> On 2011/08/29 (Aug), at 12:58 PM, Ian Clarke wrote:
>
> The problem is that we come up with solutions that are too complicated to
> analyze or fix when they don't work
>
> The cause is complexity, which just grows and grows as we try to fix
> problems we don't understand by layering on more complexity.
>
>
> And what is the cause of that? Is the problem really one of *behavior*?
> emotions? workflow? organization?
>

A combination, but at its core I think its a failure to recognize that most
working systems are built on principles that are simple enough to understand
that they are either "obviously" correct, or can easily be proven to be
correct.  For example, AIMD, which is TCP's load management system, is
simple enough that you can describe the basic idea in a sentence.  It is
also fairly obvious that it will work, or at least its easy to simulate it
and convince yourself that it will work.

In contrast, Freenet's current load management system is a combination of
mechanisms that are perhaps individually of comparable complexity to AIMD
(in fact, AIMD is part of it although used in quite a different context to
how it is used in TCP), but together they are far more complex, and interact
in ways that are hard to predict.

For example, in a conversation with Matthew a few days ago, he realized that
the "fair sharing" mechanism which tries to allocate resources fairly among
"downstream" nodes, was probably generating a bunch of rejected messages
when it transitions from one mode of operation to another, with all kinds of
negative consequences.  Who knows how many other unintended interactions
there are?


> Matthew has presented some very real problems which he is trying to work
> around (with much frustration, I'm sure). I think he needs more leverage
>

My criticism is that Matthew's proposals for "fixing" the problem follow
exactly the same pattern of all the previous "fixes" that turned out to have
either no effect, or a negative effect.  It starts out with a bunch of
hand-wavey hypotheses about what the problem might be, which is followed by
a bunch of fixes for what might be the problem.  There is no evidence that
these problems are real, I think a big part of the reason is that the
existing system is too complicated to debug.

I know it sounds like I'm beating up on Matthew here, and that isn't my
intention, he is following a precedent set by me and others that have long
since left the project.  Having (I hope) learned some tough lessons, and
recognized the folly in our past approach, I'm now hoping its not too late
to rescue the project from an ignominious death.  I think we're talking
about weeks of work here, not months, and frankly I don't think we've got a
choice if the project is to survive.  Freenet is no use to anyone if it
doesn't work, regardless of how cool our goals are, or how clever our
algorithms and heuristics.


> If you read my suggestion below, we can discuss how it would allow:
>
> With an investment of developer time, we could separate the current freenet
> code into three interfaced sections (link-layer, routing-layer,
> user/client-interface-layer).
>
> If we then were to modify the outer layers to accept two routing-layers
> (e.g. client requests round-robin between the two but thereafter stay in
> that network) we could have "two networks in one" a stable-net (for the
> nay-sayers, a disaster/fallback, and as a control for measurement), and a
> development-net where experimentation could take place.
>
> Drawing the interface lines on theory (rather than present code-state)
> would be critical [e.g. load-balancing should be in the middle layer, imo].
> The goal being, reliable communication with near-guaranteed/methodical
> improvement
>
> While I'm sure there is much room for improvement in the way the code is
architected, and the separation of concerns - I don't think refactoring is
the answer.  We need to refactor the fundamental way that we solve problems
like load balancing.

Ian.


-- 
Ian Clarke
Founder, The Freenet Project
Email: ian at freenetproject.org
-- next part --
An HTML attachment was scrubbed...
URL: 



[freenet-dev] Beyond New Load Management

2011-08-30 Thread Robert Hailey

On 2011/08/30 (Aug), at 3:17 AM, Thomas Bruderer wrote:

> Thank you Ian! Good message! I am 100% behind your whole post! The  
> routing must go back to a simple system!

Even tearing up the current systems for a fifo queue is destructive  
without organization and a means of comparison.

In the science of software, there are generally two kinds of commits:

* Features - divergent, potentially disruptive, tend to contain bugs
* Bugfixes - convergent, generally repairing feature disruption,  
rarely unmask other bugs

On 2011/08/29 (Aug), at 12:58 PM, Ian Clarke wrote:

> The problem is that we come up with solutions that are too  
> complicated to analyze or fix when they don't work
> The cause is complexity, which just grows and grows as we try to fix  
> problems we don't understand by layering on more complexity.

And what is the cause of that? Is the problem really one of behavior?  
emotions? workflow? organization?

Matthew has presented some very real problems which he is trying to  
work around (with much frustration, I'm sure). I think he needs more  
leverage

If you read my suggestion below, we can discuss how it would allow:

* a full-network-sized network to try out new features & bugfixes
* nearly-identical traffic on both networks
* zero traffic redundancy (network waste)
* easy 1:1 comparison of various performances of two implementations  
(for measuring network effectiveness)
* full network uptime (at worst a totally-broken / 0% unstable-side  
would yield a 50% reduction in effectiveness)

It is simply a matter of *organization* and *multi-versioning* (which  
IMO, are both solved problems).

This is also tied into the subject of 'mandatory' node update  
deadline, as the utility of a split network would diminish if it's  
stable-side succumbs to the same 'chase' issues.

--
Robert Hailey


On 2011/08/26 (Aug), at 10:18 AM, Robert Hailey wrote:

>
> On 2011/08/25 (Aug), at 2:15 PM, Matthew Toseland wrote:
>
>> And we never, ever, ever, have enough data to evaluate a single  
>> build, even on the simplest metrics (see the push-pull tests). I  
>> could write a plugin to get more data, but digger3 promises to do  
>> it eventually and anyway I don't have time given the remaining  
>> funding and unlikeliness of getting more. And it's always been this  
>> way!
>>
>> Our whole business model forces me to just do things and not  
>> evaluate them!
>
> I think we had an idea for empirical stepwise advancement earlier.
>
> With an investment of developer time, we could separate the current  
> freenet code into three interfaced sections (link-layer, routing- 
> layer, user/client-interface-layer).
>
> If we then were to modify the outer layers to accept two routing- 
> layers (e.g. client requests round-robin between the two but  
> thereafter stay in that network) we could have "two networks in one"  
> a stable-net (for the nay-sayers, a disaster/fallback, and as a  
> control for measurement), and a development-net where  
> experimentation could take place.
>
> Drawing the interface lines on theory (rather than present code- 
> state) would be critical [e.g. load-balancing should be in the  
> middle layer, imo]. The goal being, reliable communication with near- 
> guaranteed/methodical improvement.
>
> --
> Robert Hailey
>
> ___
> Devl mailing list
> Devl at freenetproject.org
> http://freenetproject.org/cgi-bin/mailman/listinfo/devl

-- next part --
An HTML attachment was scrubbed...
URL: 



[freenet-dev] Beyond New Load Management

2011-08-30 Thread Thomas Bruderer

> The entire approach of coming up with hypotheses about what is wrong, 
> building a solution based on these hypotheses (without actually 
> confirming that the hypotheses are accurate) and deploying it is deja 
> vu, we've been doing it for a decade, and we still haven't got load 
> management right.  We're just layering more complexity onto a system 
> that we already don't understand, based on guesses as to what the 
> problems were with the previous iteration that we can't test because 
> the system is too complicated with too many interactions for anyone to 
> get their heads around it.
>

Thank you Ian! Good message! I am 100% behind your whole post! The 
routing must go back to a simple system!



[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Arne Bab.
Von: "Matthew Toseland" ???
>But the other question is, can queueing ever be helpful? It can if it allows 
>us to route more accurately (which NLM clearly does), and/or to run enough 
>requests in parallel that the longer time taken for the request to reach its 
>destination is offset. Is this condition met?

Experience with the deployed NLM showed that even in the fully congested case 
it had success rates of 60% for HTL 18,17 and 16, compared to less than 40% for 
OLM. This means that the requests are sent over fewer hops on average, because 
find the content fewer hops away from the requester.

A download of 1MiB which is sent over 2 hops needs 2 MiB in total network 
bandwidth.
If it is sent over only 1.5 hops on average, then it needs only 1.5 MiB total 
network bandwidth.

So essentially NLM can distribute 30% more content with the same network 
resources?. And these numbers are actual observations. The only reason why this 
did not result in increased performance is that the nodes used less than 50% of 
their allocated bandwidth? - which is a problem with the bandwidth scheduler 
and not with queueing.

Best wishes,
Arne

?: The relevant network resource is upload bandwidth.
?: Source: observations from me and two other freenet users.

PS: How exactly the bandwidth limiter is fixed is an implementation detail. I 
think you are actually the only person who can judge how to do this most 
efficiently.

___
Schon geh?rt? WEB.DE hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://produkte.web.de/go/toolbar
-- next part --
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1743 bytes
Desc: S/MIME Cryptographic Signature
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Arne Babenhauserheide
After discussing and going deeper into this, it became apparent that the 
problem is not overload of queues. I?ll repeat the result I came to here (and 
not in chatlog form :) ): 

The problem is in the load limiter: 


1) we need to use all bandwidth, because latency depends on the number of  
finishing transfers.

2) SSKs fail 85% of the time.

3) the load limiter assumes that all SSKs succeed and reserves bandwidth for 
them.

4) this now bites us, because NLM has shorter transfer times and longer wait 
times. So the over-reservation of bandwidth for SSKs lasts longer. 

5) solution: count each SSK as only 
average_SSK_success_rate * data_to_transfer_on_success.


To (1): When we have less bandwidth, than less CHKs succeed per second, so 
less slots get freed and the queueing time it bigger.

Best wishes, 
Arne



Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> Okay, I don't understand most of that, I might be able to check the math if
> it was written properly, but it looks difficult. However, as far as I can
> see: - The most obvious way to increase bandwidth usage would be to
> increase the timeout time for output bandwidth liability (and at the same
> time increase the relevant block transfer timeouts). - This would increase
> the number of slots but it would also increase the number of requests
> seeking them; I don't see why it would help matters. - Running an excessive
> number of requests without adjusting the block transfer timeouts would
> result in some of the transfers timing out. - It would also, as you
> mention, make SSK requests (especially failed SSK requests) even slower. -
> I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
> As far as I can see, all the benefits of NLM re attackers are achieved by
> fair sharing. - A system which is more resistant to attacks but slower
> probably isn't all that interesting if the attacks in question are
> relatively expensive anyway. - Smaller block sizes would have a significant
> efficiency cost, and would probably make load management more difficult.
> 
> I apologise if you see this as rather a broadside after I encouraged you to
> analyse the problem, but it is not yet a convincing demonstration that
> queueing is a viable strategy and not the spawn of satan! :)
> On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > > Yes, small tweaks have worked so well for us for the last decade,
> > > leaving us pretty-much where we were in 2003.  No, we don't
> > > understand how the current system works, there is no point in
> > > trying to fix something when we don't even know what is broken.
> > 
> > I?d like to present a clue what is broken in NLM. Before I kill you with
> > the> 
> > log, here?s the result:
> > With NLM the latency of a request is a function of the raw bandwidth
> > (not so with OLM), and NLM used only half my bandwidth after it had
> > been deployed for 2 days (at the start much more).
> > 
> > ? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ; with ?: transfer time, q: queue
> > time (time to find the node), nlm: new load management, olm: old load
> > management.
> > 
> > So first step: make sure all bandwidth gets used - maybe by allocating
> > more slots till we use all allowed bandwidth. Better having to throttle
> > a transfer than not using bandwidth.
> > 
> > *NLM should with the current network be slower than OLM by 23%. But in
> > 18
> > months it should actually be faster by ~8% ? given Moores Law holds for
> > upload bandwidth ? because the routes are shorter.*
> > 
> > The main advantage of NLM is, that it should be much more resilient
> > against attackers (DoS).
> > 
> > Now to the log - it?s math and not cleaned up; you have been warned :)
> > 
> >  SSK-time: ?, CHK-time: ?, success: Xs, fail: Xf.
> >  queue-time: q, transfer-time: ?, hops remaining: h, total
> > hops: h?, w: success rate
> >  ?s = ?(h) + q(h)
> >  ?f = q(h)
> >  ? ~ w???s + (1-w?)??f
> >  ?s = ?(h) + q(h)
> >  ?f = q(h)
> >  ? ~ w???s + (1-w?)??f; w? ~ 15%
> >  num(?) / num(?) ~ 1
> >  ? time ~ ? + ?
> >  q(h) depends on timeouts, as do w? and w?
> >  time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
> >  = w? ? (?(h) + q(h)) + (1-w?)?q(h) + w? ? (?(h) + q(h)) + (1-
> > w?)?q(h)
> >  = t(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
> >  = ?(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
> >  in the congestion case q(h) ~ timeout
> >  timeout = o
> >  timeout: o
> >  w depends on the timeout *somehow*, but inversely
> >  o=0 ? w=0
> >  assumption: o = ? ? w? ~ 20%, w? ~ 100%
> >  assumption: o = ? ? w? ~ 0.2, w? ~ 1
> >  correction: in the congestion case: q(h) ~ min(timeout, ?(h))
> >  timeout matters for q(h) only when timeout < ?(h)
> >  I try to: I still need a dependency of w on timeout
> >  ? lets call it t(w)
> >  better: w(o) :)
> >  well, if there is a timeout, we have a fixed time, but we reduce
> > the hops ...
> >  i 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:18:35 Matthew Toseland wrote:
> On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
> > After discussing and going deeper into this, it became apparent that the 
> > problem is not overload of queues. I?ll repeat the result I came to here 
> > (and 
> > not in chatlog form :) ): 
> > 
> > The problem is in the load limiter: 
> > 
> > 
> > 1) we need to use all bandwidth, because latency depends on the number of  
> > finishing transfers.
> > 
> > 2) SSKs fail 85% of the time.
> > 
> > 3) the load limiter assumes that all SSKs succeed and reserves bandwidth 
> > for 
> > them.
> > 
> > 4) this now bites us, because NLM has shorter transfer times and longer 
> > wait 
> > times. So the over-reservation of bandwidth for SSKs lasts longer. 
> > 
> > 5) solution: count each SSK as only 
> > average_SSK_success_rate * data_to_transfer_on_success.
> > 
> > 
> > To (1): When we have less bandwidth, than less CHKs succeed per second, so 
> > less slots get freed and the queueing time it bigger.
> 
> Another way to look at this:
> 
> Both me and Ian have been assuming that the resource is some number 
> representing a property of the node, for example, how many requests are 
> running. But the real resource is the external bandwidth usage, which is 
> rather different.
> 
> An alternative solution - if we are worried about bursts of requests 
> succeeding and therefore timing out - is to allocate a fixed chunk of 
> bandwidth for SSKs, and not let them interfere with CHKs. Then CHKs can have 
> long transfer times, and therefore long queueing times when they fail (but 
> this is only ~ 40% of the time with NLM's accurate routing), and SSKs can 
> have short transfer times, and therefore short queueing times. This may be 
> less disruptive and should also be relatively easy to implement.
> 
> But the other question is, can queueing ever be helpful? It can if it allows 
> us to route more accurately (which NLM clearly does), and/or to run enough 
> requests in parallel that the longer time taken for the request to reach its 
> destination is offset. Is this condition met?

Yes it is!

If a request takes P seconds without queueing, and P+Q seconds with queueing, 
then if we have the same number of requests running at once, we have a slowdown 
of (P+Q)/P. However, if we start (P+Q)/P times more requests, and the queueing 
time remains constant, we can compensate for this. 

Does it? 

There are two main components:
1) The queueing time. This is roughly proportional to (number of requests 
running) / (number of slots available). If we increase the capacity of the 
node, we multiply both the top and bottom by the same factor - so the queueing 
time is unchanged, unless the data transfer time is changed.
2) The data transfer time. In the above case with a slowdown, the data transfer 
time is the same, or a little shorter, because most of our bandwidth isn't 
being used. If we increase it by running more requests at once, it should reach 
where it was, and if we increase it further, transfers will start to take 
longer again.

ArneBab has analysed this in more detail and shown that the largest component 
of the queueing time is proportional to the data transfer time, hence, 
separating SSKs (which have a very short transfer time) from CHKs (which have a 
much longer transfer time) would help significantly.

The key advantage of NLM, of course, is that it routes accurately, resulting in 
10%-30% improvements in success rates, and transferring data over significantly 
fewer hops (= more effective bandwidth).
> > 
> > Best wishes, 
> > Arne
> > 
> > 
> > 
> > Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> > > Okay, I don't understand most of that, I might be able to check the math 
> > > if
> > > it was written properly, but it looks difficult. However, as far as I can
> > > see: - The most obvious way to increase bandwidth usage would be to
> > > increase the timeout time for output bandwidth liability (and at the same
> > > time increase the relevant block transfer timeouts). - This would increase
> > > the number of slots but it would also increase the number of requests
> > > seeking them; I don't see why it would help matters. - Running an 
> > > excessive
> > > number of requests without adjusting the block transfer timeouts would
> > > result in some of the transfers timing out. - It would also, as you
> > > mention, make SSK requests (especially failed SSK requests) even slower. -
> > > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. 
> > > -
> > > As far as I can see, all the benefits of NLM re attackers are achieved by
> > > fair sharing. - A system which is more resistant to attacks but slower
> > > probably isn't all that interesting if the attacks in question are
> > > relatively expensive anyway. - Smaller block sizes would have a 
> > > significant
> > > efficiency cost, and would probably make load management more 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
> After discussing and going deeper into this, it became apparent that the 
> problem is not overload of queues. I?ll repeat the result I came to here (and 
> not in chatlog form :) ): 
> 
> The problem is in the load limiter: 
> 
> 
> 1) we need to use all bandwidth, because latency depends on the number of  
>   finishing transfers.
> 
> 2) SSKs fail 85% of the time.
> 
> 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for 
>   them.
> 
> 4) this now bites us, because NLM has shorter transfer times and longer wait 
>   times. So the over-reservation of bandwidth for SSKs lasts longer. 
> 
> 5) solution: count each SSK as only 
>   average_SSK_success_rate * data_to_transfer_on_success.
> 
> 
> To (1): When we have less bandwidth, than less CHKs succeed per second, so 
> less slots get freed and the queueing time it bigger.

Another way to look at this:

Both me and Ian have been assuming that the resource is some number 
representing a property of the node, for example, how many requests are 
running. But the real resource is the external bandwidth usage, which is rather 
different.

An alternative solution - if we are worried about bursts of requests succeeding 
and therefore timing out - is to allocate a fixed chunk of bandwidth for SSKs, 
and not let them interfere with CHKs. Then CHKs can have long transfer times, 
and therefore long queueing times when they fail (but this is only ~ 40% of the 
time with NLM's accurate routing), and SSKs can have short transfer times, and 
therefore short queueing times. This may be less disruptive and should also be 
relatively easy to implement.

But the other question is, can queueing ever be helpful? It can if it allows us 
to route more accurately (which NLM clearly does), and/or to run enough 
requests in parallel that the longer time taken for the request to reach its 
destination is offset. Is this condition met?
> 
> Best wishes, 
> Arne
> 
> 
> 
> Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> > Okay, I don't understand most of that, I might be able to check the math if
> > it was written properly, but it looks difficult. However, as far as I can
> > see: - The most obvious way to increase bandwidth usage would be to
> > increase the timeout time for output bandwidth liability (and at the same
> > time increase the relevant block transfer timeouts). - This would increase
> > the number of slots but it would also increase the number of requests
> > seeking them; I don't see why it would help matters. - Running an excessive
> > number of requests without adjusting the block transfer timeouts would
> > result in some of the transfers timing out. - It would also, as you
> > mention, make SSK requests (especially failed SSK requests) even slower. -
> > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
> > As far as I can see, all the benefits of NLM re attackers are achieved by
> > fair sharing. - A system which is more resistant to attacks but slower
> > probably isn't all that interesting if the attacks in question are
> > relatively expensive anyway. - Smaller block sizes would have a significant
> > efficiency cost, and would probably make load management more difficult.
> > 
> > I apologise if you see this as rather a broadside after I encouraged you to
> > analyse the problem, but it is not yet a convincing demonstration that
> > queueing is a viable strategy and not the spawn of satan! :)
> > On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> > > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > > > Yes, small tweaks have worked so well for us for the last decade,
> > > > leaving us pretty-much where we were in 2003.  No, we don't
> > > > understand how the current system works, there is no point in
> > > > trying to fix something when we don't even know what is broken.
> > > 
> > > I?d like to present a clue what is broken in NLM. Before I kill you with
> > > the> 
> > > log, here?s the result:
> > >   With NLM the latency of a request is a function of the raw bandwidth
> > >   (not so with OLM), and NLM used only half my bandwidth after it had
> > >   been deployed for 2 days (at the start much more).
> > > 
> > > ? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ; with ?: transfer time, q: queue
> > > time (time to find the node), nlm: new load management, olm: old load
> > > management.
> > > 
> > > So first step: make sure all bandwidth gets used - maybe by allocating
> > > more slots till we use all allowed bandwidth. Better having to throttle
> > > a transfer than not using bandwidth.
> > > 
> > > *NLM should with the current network be slower than OLM by 23%. But in
> > > 18
> > > months it should actually be faster by ~8% ? given Moores Law holds for
> > > upload bandwidth ? because the routes are shorter.*
> > > 
> > > The main advantage of NLM is, that it should be much more resilient
> > 

Re: [freenet-dev] Beyond New Load Management

2011-08-30 Thread Thomas Bruderer


The entire approach of coming up with hypotheses about what is wrong, 
building a solution based on these hypotheses (without actually 
confirming that the hypotheses are accurate) and deploying it is deja 
vu, we've been doing it for a decade, and we still haven't got load 
management right.  We're just layering more complexity onto a system 
that we already don't understand, based on guesses as to what the 
problems were with the previous iteration that we can't test because 
the system is too complicated with too many interactions for anyone to 
get their heads around it.




Thank you Ian! Good message! I am 100% behind your whole post! The 
routing must go back to a simple system!

___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl


Re: [freenet-dev] Beyond New Load Management

2011-08-30 Thread Robert Hailey


On 2011/08/30 (Aug), at 3:17 AM, Thomas Bruderer wrote:

Thank you Ian! Good message! I am 100% behind your whole post! The  
routing must go back to a simple system!


Even tearing up the current systems for a fifo queue is destructive  
without organization and a means of comparison.


In the science of software, there are generally two kinds of commits:

* Features - divergent, potentially disruptive, tend to contain bugs
* Bugfixes - convergent, generally repairing feature disruption,  
rarely unmask other bugs


On 2011/08/29 (Aug), at 12:58 PM, Ian Clarke wrote:

The problem is that we come up with solutions that are too  
complicated to analyze or fix when they don't work
The cause is complexity, which just grows and grows as we try to fix  
problems we don't understand by layering on more complexity.


And what is the cause of that? Is the problem really one of behavior?  
emotions? workflow? organization?


Matthew has presented some very real problems which he is trying to  
work around (with much frustration, I'm sure). I think he needs more  
leverage


If you read my suggestion below, we can discuss how it would allow:

* a full-network-sized network to try out new features  bugfixes
* nearly-identical traffic on both networks
* zero traffic redundancy (network waste)
* easy 1:1 comparison of various performances of two implementations  
(for measuring network effectiveness)
* full network uptime (at worst a totally-broken / 0% unstable-side  
would yield a 50% reduction in effectiveness)


It is simply a matter of *organization* and *multi-versioning* (which  
IMO, are both solved problems).


This is also tied into the subject of 'mandatory' node update  
deadline, as the utility of a split network would diminish if it's  
stable-side succumbs to the same 'chase' issues.


--
Robert Hailey


On 2011/08/26 (Aug), at 10:18 AM, Robert Hailey wrote:



On 2011/08/25 (Aug), at 2:15 PM, Matthew Toseland wrote:

And we never, ever, ever, have enough data to evaluate a single  
build, even on the simplest metrics (see the push-pull tests). I  
could write a plugin to get more data, but digger3 promises to do  
it eventually and anyway I don't have time given the remaining  
funding and unlikeliness of getting more. And it's always been this  
way!


Our whole business model forces me to just do things and not  
evaluate them!


I think we had an idea for empirical stepwise advancement earlier.

With an investment of developer time, we could separate the current  
freenet code into three interfaced sections (link-layer, routing- 
layer, user/client-interface-layer).


If we then were to modify the outer layers to accept two routing- 
layers (e.g. client requests round-robin between the two but  
thereafter stay in that network) we could have two networks in one  
a stable-net (for the nay-sayers, a disaster/fallback, and as a  
control for measurement), and a development-net where  
experimentation could take place.


Drawing the interface lines on theory (rather than present code- 
state) would be critical [e.g. load-balancing should be in the  
middle layer, imo]. The goal being, reliable communication with near- 
guaranteed/methodical improvement.


--
Robert Hailey

___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl


___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management

2011-08-30 Thread Ian Clarke
On Tue, Aug 30, 2011 at 11:49 AM, Robert Hailey
rob...@freenetproject.orgwrote:

 On 2011/08/29 (Aug), at 12:58 PM, Ian Clarke wrote:

 The problem is that we come up with solutions that are too complicated to
 analyze or fix when they don't work

 The cause is complexity, which just grows and grows as we try to fix
 problems we don't understand by layering on more complexity.


 And what is the cause of that? Is the problem really one of *behavior*?
 emotions? workflow? organization?


A combination, but at its core I think its a failure to recognize that most
working systems are built on principles that are simple enough to understand
that they are either obviously correct, or can easily be proven to be
correct.  For example, AIMD, which is TCP's load management system, is
simple enough that you can describe the basic idea in a sentence.  It is
also fairly obvious that it will work, or at least its easy to simulate it
and convince yourself that it will work.

In contrast, Freenet's current load management system is a combination of
mechanisms that are perhaps individually of comparable complexity to AIMD
(in fact, AIMD is part of it although used in quite a different context to
how it is used in TCP), but together they are far more complex, and interact
in ways that are hard to predict.

For example, in a conversation with Matthew a few days ago, he realized that
the fair sharing mechanism which tries to allocate resources fairly among
downstream nodes, was probably generating a bunch of rejected messages
when it transitions from one mode of operation to another, with all kinds of
negative consequences.  Who knows how many other unintended interactions
there are?


 Matthew has presented some very real problems which he is trying to work
 around (with much frustration, I'm sure). I think he needs more leverage


My criticism is that Matthew's proposals for fixing the problem follow
exactly the same pattern of all the previous fixes that turned out to have
either no effect, or a negative effect.  It starts out with a bunch of
hand-wavey hypotheses about what the problem might be, which is followed by
a bunch of fixes for what might be the problem.  There is no evidence that
these problems are real, I think a big part of the reason is that the
existing system is too complicated to debug.

I know it sounds like I'm beating up on Matthew here, and that isn't my
intention, he is following a precedent set by me and others that have long
since left the project.  Having (I hope) learned some tough lessons, and
recognized the folly in our past approach, I'm now hoping its not too late
to rescue the project from an ignominious death.  I think we're talking
about weeks of work here, not months, and frankly I don't think we've got a
choice if the project is to survive.  Freenet is no use to anyone if it
doesn't work, regardless of how cool our goals are, or how clever our
algorithms and heuristics.


 If you read my suggestion below, we can discuss how it would allow:

 With an investment of developer time, we could separate the current freenet
 code into three interfaced sections (link-layer, routing-layer,
 user/client-interface-layer).

 If we then were to modify the outer layers to accept two routing-layers
 (e.g. client requests round-robin between the two but thereafter stay in
 that network) we could have two networks in one a stable-net (for the
 nay-sayers, a disaster/fallback, and as a control for measurement), and a
 development-net where experimentation could take place.

 Drawing the interface lines on theory (rather than present code-state)
 would be critical [e.g. load-balancing should be in the middle layer, imo].
 The goal being, reliable communication with near-guaranteed/methodical
 improvement

 While I'm sure there is much room for improvement in the way the code is
architected, and the separation of concerns - I don't think refactoring is
the answer.  We need to refactor the fundamental way that we solve problems
like load balancing.

Ian.


-- 
Ian Clarke
Founder, The Freenet Project
Email: i...@freenetproject.org
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-30 Thread Arne Babenhauserheide
Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide:
 5) solution: count each SSK as only
   average_SSK_success_rate * data_to_transfer_on_success.

Some more data:

chances of having at least this many successful transfers for 40 SSKs with a
mean success rate of 16%:

for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done

0 1.0
1 0.999064224991
2 0.99193451064
3 0.965452714478
4 0.901560126912
5 0.788987472629
6 0.634602118184
7 0.463062835467
8 0.304359825607
9 0.179664603573
10 0.0952149293922
11 0.0453494074947
12 0.0194452402752
13 0.00752109980912
14 0.0026291447461
15 0.000832100029072
16 0.00023879002726

what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as
value makes sure that 95% of the possible cases don’t exhaust the bandwidth.
We then use only 64% of the bandwidth on average, though. With 0.2, we’d get
68% of the possible distributions safe and use 80% of bandwidth on average.

Note: this is just a binomial spread:

from math import factorial
fac = factorial
def nük(n, k):
   if k  n: return 0
   return fac(n) / (fac(k)*fac(n-k))

def binom(p, n, k):
   return nük(n, k) * p** k * (1-p)**(n-k)

def spielfähig(p, n, min_spieler):
   return sum([binom(p, n, k) for k in range(min_spieler, n+1)])


→ USK@6~ZDYdvAgMoUfG6M5Kwi7SQqyS-
gTcyFeaNN1Pf3FvY,OSOT4OEeg4xyYnwcGECZUX6~lnmYrZsz05Km7G7bvOQ,AQACAAE/bab/9/Content-
D426DC7.html


Best wishes,
Arne

signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:32:13 Ian Clarke wrote:
> On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland  amphibian.dyndns.org
> > wrote:
> 
> > On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > > proposal.  I'm assuming that we get rid of NLM, fair sharing, and
> > anything
> > > else intended to control load, and replace it with this.  We will
> > absolutely
> > > need to simulate this before we write a single line of code to deploy it.
> >
> > I have justified fair sharing in my other reply. However alternate
> > mechanisms might replace it.
> >
> > I agree we should simulate, although this might be significant work.
> >
> 
> I doubt it would be more work than trying it without simulation and failing
> for a decade, which was our previous approach.
> 
> 
> > > The core idea is that a node will include a floating point number in
> > > response to any kind of request showing how close that node is to being
> > > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > > completely saturated and must reject requests.  Clearly the goal is to
> > avoid
> > > getting anywhere near to 1.0.
> > >
> > > A node tracks several things:
> >
> > All of these apply to both locally originated requests and remotely
> > originated requests? And are they for only our direct peers?
> >
> 
> Every peer along the reply path will take note of the load value, so it will
> be a mixture of the loads of close-by peers, and more distant peers.
>  Obviously, since we are more likely to receive messages from peers that are
> "closer", our average will automatically be biased towards closer peers.

So it is from the data source? Or every node along the way?
> 
> >- What is the average load reported by responses this node forwarded,
> > per
> > >remote node
> >
> > Ahhh, this one could be interesting - you could use it to penalise nodes
> > which spam excessively.
> >
> Exactly.
> 
> > > I think, given these metrics, we should be able to do the following:
> > >
> > >- Limit our overall rate of initiating local requests based on the
> > global
> > >average reported load
> >
> > We can already do this on the basis of AIMD's, based purely on behaviour of
> > local requests. However, taking more data into account might result in more
> > accurate results - although it might also dilute the feedback.
> 
> Right, AIMD tends to control the rate at which a specific event occurs, like
> a dropped IP packet with TCP, or a dropped request as we currently use it.
>  This is a bit like signaling that you have a headache by banging your head
> against the wall.  It is also a rather coarse way to deliver feedback.

Fair point.
> 
> A floating point load number conveys more information in a non-destructive
> way.
> 
> In any case, this part is "safe": it's not going to cause feedback problems,
> > on its own. Conversely, it doesn't deal with problems such as network
> > bottlenecks.
> 
> Well, it depends.  If we are tracking load on a per-outbound-connection
> basis then we can potentially take action to manage load per-connection in
> addition to overall.

Not sure I follow here. Routing by load is bad, unless it is severely limited.
> 
> > However, while AIMD's are based on the *global* average load (or rather the
> > load of all the nodes along the chain), it sounds like you are proposing to
> > only report the *local* load of our direct peers? How are you going to
> > compute the global average load?
> 
> No, its not the local load of our direct peers.  Its the load reported in
> all reply messages that we forward, some of those will be from our direct
> peers, but most will be from remote peers.
> 
> > >- Limit our rate of local requests based on the average load of the
> > >connection to the peer it would need to be forwarded to
> >
> > Not sure I follow here. Are you proposing that we initiate more requests in
> > keyspace areas where there is lower load?
> 
> Well, kinda, except the opposite.  We might reduce requests in keyspaces
> where there is higher load.
> 
> 
> > >- Detect when remote peers are abusing the system by disregarding load
> > -
> > >as evidenced by a significantly higher average load in replies
> > forwarded to
> > >them
> >
> > Okay, that is useful - although I doubt it would be an unambiguous
> > detection, and what would we do about it? We would have to reject requests,
> > and if we're not sure, or if it could be a downstream troublemaker, couldn't
> > that result in misrouting and make things worse? On the other hand, any
> > malicious node would be most easily detected by the nodes directly connected
> > to it, so this could work much the same way as fair sharing - provided that
> > it can reject or terminate the requests.
> 
> No, we probably wouldn't start rejecting requests - more likely we would
> disconnect from that node.  Obviously we'd need to be fairly sure of abusive
> 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
Okay, I don't understand most of that, I might be able to check the math if it 
was written properly, but it looks difficult. However, as far as I can see:
- The most obvious way to increase bandwidth usage would be to increase the 
timeout time for output bandwidth liability (and at the same time increase the 
relevant block transfer timeouts).
- This would increase the number of slots but it would also increase the number 
of requests seeking them; I don't see why it would help matters.
- Running an excessive number of requests without adjusting the block transfer 
timeouts would result in some of the transfers timing out.
- It would also, as you mention, make SSK requests (especially failed SSK 
requests) even slower.
- I am quite confident that Moore's Law DOES NOT hold for upload bandwidth.
- As far as I can see, all the benefits of NLM re attackers are achieved by 
fair sharing.
- A system which is more resistant to attacks but slower probably isn't all 
that interesting if the attacks in question are relatively expensive anyway.
- Smaller block sizes would have a significant efficiency cost, and would 
probably make load management more difficult.

I apologise if you see this as rather a broadside after I encouraged you to 
analyse the problem, but it is not yet a convincing demonstration that queueing 
is a viable strategy and not the spawn of satan! :)

On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > Yes, small tweaks have worked so well for us for the last decade, leaving us
> > pretty-much where we were in 2003.  No, we don't understand how the current
> > system works, there is no point in trying to fix something when we don't
> > even know what is broken.
> 
> I?d like to present a clue what is broken in NLM. Before I kill you with the 
> log, here?s the result: 
> 
>   With NLM the latency of a request is a function of the raw bandwidth 
>   (not so with OLM), and NLM used only half my bandwidth after it had 
>   been deployed for 2 days (at the start much more).
> 
> ? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ; with ?: transfer time, q: queue time 
> (time to find the node), nlm: new load management, olm: old load management.
> 
> So first step: make sure all bandwidth gets used - maybe by allocating more 
> slots till we use all allowed bandwidth. Better having to throttle a transfer 
> than not using bandwidth.
> 
> *NLM should with the current network be slower than OLM by 23%. But in 18 
> months it should actually be faster by ~8% ? given Moores Law holds for 
> upload 
> bandwidth ? because the routes are shorter.*
> 
> The main advantage of NLM is, that it should be much more resilient against 
> attackers (DoS).
> 
> Now to the log - it?s math and not cleaned up; you have been warned :)
> 
>  SSK-time: ?, CHK-time: ?, success: Xs, fail: Xf. 
>  queue-time: q, transfer-time: ?, hops remaining: h, total hops: h?, 
> w: success rate
>  ?s = ?(h) + q(h)
>  ?f = q(h)
>  ? ~ w???s + (1-w?)??f
>  ?s = ?(h) + q(h)
>  ?f = q(h)
>  ? ~ w???s + (1-w?)??f; w? ~ 15%
>  num(?) / num(?) ~ 1
>  ? time ~ ? + ?
>  q(h) depends on timeouts, as do w? and w?
>  time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
>  = w? ? (?(h) + q(h)) + (1-w?)?q(h) + w? ? (?(h) + q(h)) + (1-
> w?)?q(h)
>  = t(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
>  = ?(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
>  in the congestion case q(h) ~ timeout
>  timeout = o
>  timeout: o
>  w depends on the timeout *somehow*, but inversely
>  o=0 ? w=0
>  assumption: o = ? ? w? ~ 20%, w? ~ 100%
>  assumption: o = ? ? w? ~ 0.2, w? ~ 1
>  correction: in the congestion case: q(h) ~ min(timeout, ?(h))
>  timeout matters for q(h) only when timeout < ?(h)
>  I try to: I still need a dependency of w on timeout
>  ? lets call it t(w)
>  better: w(o) :)
>  well, if there is a timeout, we have a fixed time, but we reduce the 
> hops ...
>  i thought w was success rate
>  ah! 
>  and the success rates where in the NLM stats
>  going mostly smoothly from 60% to 0%
>  for the HTL
>  right, success rate peaks at 18 or sometimes 16
>  what are w1 vs w2?
>  chk vs ssk i guess
>  yes
> -*- toad_ thinks considering both is probably overambitious at this stage?
>  should not be too bad: SSKs drop much more rapidly at decreasing 
> hops
>  hops?HTL
>  ?s is time for a successful chk; ?f is time for a failed chk ... in 
> which case h in the first instance is low, and in the second instance is h0
>  yes
>  okay, i don't follow this line: time =  w???s + (1-w?)??f +  w???s + 
> (1-w?)??f
>  i thought w2 related to SSKs?
>  uh, yes?
>  time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
>  you have to appreciate i'm only just getting back into maths and 
> physics after 12 years ...
>  (retaking a-levels to get a degree)
>  no probs, I?m also no expert in this. I try to get a relation 
> between the time and the timeout, so we can try to find a minimum
>  in any case, there are two different h's for the 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Babenhauserheide
Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> Yes, small tweaks have worked so well for us for the last decade, leaving us
> pretty-much where we were in 2003.  No, we don't understand how the current
> system works, there is no point in trying to fix something when we don't
> even know what is broken.

I?d like to present a clue what is broken in NLM. Before I kill you with the 
log, here?s the result: 


With NLM the latency of a request is a function of the raw bandwidth 
(not so with OLM), and NLM used only half my bandwidth after it had 
been deployed for 2 days (at the start much more).


? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ; with ?: transfer time, q: queue time 
(time to find the node), nlm: new load management, olm: old load management.



So first step: make sure all bandwidth gets used - maybe by allocating more 
slots till we use all allowed bandwidth. Better having to throttle a transfer 
than not using bandwidth.


*NLM should with the current network be slower than OLM by 23%. But in 18 
months it should actually be faster by ~8% ? given Moores Law holds for upload 
bandwidth ? because the routes are shorter.*


The main advantage of NLM is, that it should be much more resilient against 
attackers (DoS).


Now to the log - it?s math and not cleaned up; you have been warned :)


 SSK-time: ?, CHK-time: ?, success: Xs, fail: Xf. 
 queue-time: q, transfer-time: ?, hops remaining: h, total hops: h?, 
w: success rate
 ?s = ?(h) + q(h)
 ?f = q(h)
 ? ~ w???s + (1-w?)??f
 ?s = ?(h) + q(h)
 ?f = q(h)
 ? ~ w???s + (1-w?)??f; w? ~ 15%
 num(?) / num(?) ~ 1
 ? time ~ ? + ?
 q(h) depends on timeouts, as do w? and w?
 time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
 = w? ? (?(h) + q(h)) + (1-w?)?q(h) + w? ? (?(h) + q(h)) + (1-
w?)?q(h)
 = t(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
 = ?(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?)
 in the congestion case q(h) ~ timeout
 timeout = o
 timeout: o
 w depends on the timeout *somehow*, but inversely
 o=0 ? w=0
 assumption: o = ? ? w? ~ 20%, w? ~ 100%
 assumption: o = ? ? w? ~ 0.2, w? ~ 1
 correction: in the congestion case: q(h) ~ min(timeout, ?(h))
 timeout matters for q(h) only when timeout < ?(h)
 I try to: I still need a dependency of w on timeout
 ? lets call it t(w)
 better: w(o) :)
 well, if there is a timeout, we have a fixed time, but we reduce the 
hops ...
 i thought w was success rate
 ah! 
 and the success rates where in the NLM stats
 going mostly smoothly from 60% to 0%
 for the HTL
 right, success rate peaks at 18 or sometimes 16
 what are w1 vs w2?
 chk vs ssk i guess
 yes
-*- toad_ thinks considering both is probably overambitious at this stage?
 should not be too bad: SSKs drop much more rapidly at decreasing 
hops
 hops?HTL
 ?s is time for a successful chk; ?f is time for a failed chk ... in 
which case h in the first instance is low, and in the second instance is h0
 yes
 okay, i don't follow this line: time =  w???s + (1-w?)??f +  w???s + 
(1-w?)??f
 i thought w2 related to SSKs?
 uh, yes?
 time =  w???s + (1-w?)??f +  w???s + (1-w?)??f
 you have to appreciate i'm only just getting back into maths and 
physics after 12 years ...
 (retaking a-levels to get a degree)
 no probs, I?m also no expert in this. I try to get a relation 
between the time and the timeout, so we can try to find a minimum
 in any case, there are two different h's for the two uses of q(h) - h0 
and h_avg
 h_avg for success and h0 for failure
 hm, yes
 which makes this harder? 
 it?s wrong anyway? the q(h_avg) was missing
 h_avg is somewhere between 5 and 10 imho
 at least it is if everything is working well and the input load isn't 
all really popular stuff (in which case it's answered quickly and can be 
ignored)
 = ?(h) ? (w?+w?) + q(h) ? (2-w?-w?) + q(h_avg) ? (w?+w?)
 would have been too easy :)
 okay so here q(h) means q(h0) i.e. h = h0, max hops?
 jepp, and max hops sinks with falling timeout
 hmm?
 the max actual hops
 on the upside, q() is linear
<-- Torgal (~Torgal at 78.251.49.8) hat das Netzwerk verlassen (Ping timeout: 
276 
seconds)
 hopefully
 yes: q(h) = h?o
 (in the congestion case)
 the problem i have is it looks like q(1) ~= time [ for a full request 
], unless load is very low
 so ? ? q?
 ? much smaller than q?
 of course it's bounded by timeouts, but i'd expect a runaway feedback 
loop until it reaches heavy timeouts and effectively cuts the htl
 well, with OLM, success time for a CHK is 1m25s, unsuccessful is 
19sec, so transfer time is at least 1 minute
 and less than 1m25; but with NLM, unsuccessful is 3 min+
 well, for SSKs in OLM the first 6 hops are the most successful, later 
ones only contribute 1% success, which piles up to ~ 12%
 okay...
 (only possible because 85% are unsuccessful)
 (otherwise this would be wrong: the contribution of later ones would 
be smaller)
 from the numbers q(h?) ~ ?(h_avg)
 well, queueing time on any hop is the time it takes to get a slot to 
route to, which is roughly equal to the time it takes 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:09:58 Matthew Toseland wrote:
> On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
> > else intended to control load, and replace it with this.  We will absolutely
> > need to simulate this before we write a single line of code to deploy it.
> 
> I have justified fair sharing in my other reply. However alternate mechanisms 
> might replace it.
> 
> I agree we should simulate, although this might be significant work.
> > 
> > The core idea is that a node will include a floating point number in
> > response to any kind of request showing how close that node is to being
> > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > completely saturated and must reject requests.  Clearly the goal is to avoid
> > getting anywhere near to 1.0.
> > 
> > A node tracks several things:
> 
> All of these apply to both locally originated requests and remotely 
> originated requests? And are they for only our direct peers?
> > 
> >- What is the overall average load reported by responses this node has
> >received
> >- What is the average load reported by responses this node has received,
> >per remote node
> >- What is the average load reported by responses this node forwarded, per
> >remote node
> 
> Ahhh, this one could be interesting - you could use it to penalise nodes 
> which spam excessively.

Actually, thinking about it, I don't think this will work. Requests are usually 
distributed evenly across the keyspace, and downstream nodes only know they are 
from this node - not which previous node they are from. So the pain will be 
shared across all the peers sending us requests, including our own local 
requests, and won't be identifiable as due to any single node. You have to do 
it by volume, not by reported load.
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
> else intended to control load, and replace it with this.  We will absolutely
> need to simulate this before we write a single line of code to deploy it.

I have justified fair sharing in my other reply. However alternate mechanisms 
might replace it.

I agree we should simulate, although this might be significant work.
> 
> The core idea is that a node will include a floating point number in
> response to any kind of request showing how close that node is to being
> overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> completely saturated and must reject requests.  Clearly the goal is to avoid
> getting anywhere near to 1.0.
> 
> A node tracks several things:

All of these apply to both locally originated requests and remotely originated 
requests? And are they for only our direct peers?
> 
>- What is the overall average load reported by responses this node has
>received
>- What is the average load reported by responses this node has received,
>per remote node
>- What is the average load reported by responses this node forwarded, per
>remote node

Ahhh, this one could be interesting - you could use it to penalise nodes which 
spam excessively.
> 
> I think, given these metrics, we should be able to do the following:
> 
>- Limit our overall rate of initiating local requests based on the global
>average reported load

We can already do this on the basis of AIMD's, based purely on behaviour of 
local requests. However, taking more data into account might result in more 
accurate results - although it might also dilute the feedback.

In any case, this part is "safe": it's not going to cause feedback problems, on 
its own. Conversely, it doesn't deal with problems such as network bottlenecks.

However, while AIMD's are based on the *global* average load (or rather the 
load of all the nodes along the chain), it sounds like you are proposing to 
only report the *local* load of our direct peers? How are you going to compute 
the global average load?

>- Limit our rate of local requests based on the average load of the
>connection to the peer it would need to be forwarded to

Not sure I follow here. Are you proposing that we initiate more requests in 
keyspace areas where there is lower load?

>- Detect when remote peers are abusing the system by disregarding load -
>as evidenced by a significantly higher average load in replies forwarded to
>them

Okay, that is useful - although I doubt it would be an unambiguous detection, 
and what would we do about it? We would have to reject requests, and if we're 
not sure, or if it could be a downstream troublemaker, couldn't that result in 
misrouting and make things worse? On the other hand, any malicious node would 
be most easily detected by the nodes directly connected to it, so this could 
work much the same way as fair sharing - provided that it can reject or 
terminate the requests.

Arguably fair sharing achieves the same objective. It could be extended in 
various ways e.g. if we are relaying a lot of slow-down notices to a peer we 
could start to think about squeezing the peer's allocation.
> 
> Of course, lots of questions:
> 
>- How do we compute the averages?  A decaying running average of some
>kind?  What time period?

Yay more tunables. :(

>- How do we translate load into a desired rate of requests?

And even more tunables. :(

>- What criteria indicate that a peer is abusing the system?  What is the
>remedy?

In the current system, don't accept their requests.

In ANY system, there will always be a borderline below which they can get away 
with it. We can either try to make this as small as possible or try not to hit 
legitimate traffic too hard.
> 
> This is basically control theory stuff, and we'll need robust answers to
> these questions before we proceed (ideally with a theoretical rather than
> experimental foundation).

IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much the 
same without having to wait years for our theoreticians to wake up.
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: 



[freenet-dev] Beyond New Load Management

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 18:58:26 Ian Clarke wrote:
> On Mon, Aug 29, 2011 at 12:37 PM, Matthew Toseland <
> toad at amphibian.dyndns.org> wrote:
> 
> > That is because we do not have the time or funding to empirically test
> > hypotheses. We don't gather enough data, we don't have a huge testnet to try
> > stuff on over extended periods, and so on. Most software development has
> > deadlines and resource limitations, but most software development doesn't
> > NEED to be empirical, at least not very often! See my other mail about this.
> 
> I don't think the problem is a lack of time or funding.  The problem is that
> we come up with solutions that are too complicated to analyze or fix when
> they don't work.  Throwing more time and money at that might help, but that
> is addressing the symptom, not the cause.  The cause is complexity, which
> just grows and grows as we try to fix problems we don't understand by
> layering on more complexity.

Perhaps. Ripping everything out and starting again is always tempting 
especially when subordinates have been doing things without asking you. :)
> 
> > Misrouting is unacceptable, in general. Extremely overloaded or extremely
> > low capacity nodes may be routed around. We might even allow some bounded
> > amount of misrouting in the more general case (e.g. go to either of the top
> > two peers for the key). But in general, transforming load into misrouting
> > (or into reduced HTL, or any other bogus escape valve) is a bad idea. We
> > need to reduce the incoming load.
> 
> Right, the same is true of queueing.  If nodes are forced to do things to
> deal with overloading that make the problem worse then the load balancing
> algorithm has failed.  Its job is to prevent that from happening.

Okay.
> 
> > > What if all nodes told other nodes that they were overloaded because all
> > > their peers are overloaded.  Such a situation would basically cause the
> > > entire network to think it was overloaded, even though nobody actually
> > was!
> > >  It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can
> > this
> > > be avoided?
> >
> > Exactly. There are two possible approaches:
> > 1. Nodes control the rate at which *local* requests are initiated, based on
> > how many slow-down signals they get.
> >
> > (This ensures that such gridlock or feedback loop problems can't happen)
> 
> Ah, so slow-downs only ever affect local request initiation?  The problem
> there is that if a peer is overloaded, only those peers that are directly
> connected to it will slow down their rate of requests.  Peers elsewhere on
> the network will continue to fire requests at it, and there will be no way
> for them to know they need to slow down :-/

No, because the slow-down's are propagated back to the original request 
originator, and only the request originator takes notice of them. That's how it 
works now.
> 
> Unless... what if the DataReply message (or whatever its called these days)
> contains a slow-down request, and its respected by every peer along that
> path according to AIMD?  The problem is that there will be a very week link
> between the rate of requests, and the rate of slow-down messages.  It would
> be like having 10,000 individual heaters with thermostats all trying to
> regulate the temperature of a gigantic space.  Each individual
> thermostat/heater will have an almost imperceptible impact.  Although
> perhaps that would work...?

The rest is plausible - feedback might be too slow, but this can be helped in 
part by an increased window size between the point at which we start sending 
slow-down messages and the point at which things start breaking more badly.
> 
> Fair sharing can limit this. Limiting backoff so that we only misroute when
> > nodes are severely overloaded can also help. We could queue (provided this
> > happens only rarely and briefly), allow bounded misrouting (e.g. route to
> > one of the top two routes), or terminate requests (thus freeing up
> > resources).
> 
> I think "fair sharing" is one of those complexities that we need to go back
> to the drawing board on.  Its clear that even now we have hypotheses about
> how fair sharing could be causing problems.  Its too complicated to really
> reason about, or to anticipate all of its implications.

It is vital for:
1. Security: It quashes DoS attacks/floods quickly. Without it, any attacker 
can easily swamp any node it is connected to, using all of its capacity for 
sending its flood onward. With it, they are limited to a fraction of the node's 
capacity. This is a concern that load management must deal with!
2. Nodes with low numbers of peers, and slower nodes in general, to get a look 
in, rather than be rejected at the same rate as everyone else.
> 
> > - The easiest way to implement #1 is with AIMD's. We can keep them, but
> > send slow-down messages earlier.
> > - We could keep NLM, i.e. queue for limited times to smooth out delays.
> 
> Doesn't that violate the principal that the load 

[freenet-dev] Beyond New Load Management

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 18:37:39 Matthew Toseland wrote:
> On Saturday 27 Aug 2011 21:35:58 Ian Clarke wrote:
> > Matthew,
> > 
> > This makes sense from the perspective of making incremental changes, but I
> > think we need to be more drastic than that.  I think we need to go back to
> > the drawing board with load management.  We need to find a solution that is
> > simple enough to reason about, and to debug if we have problems with it.
> > 
> > The entire approach of coming up with hypotheses about what is wrong,
> > building a solution based on these hypotheses (without actually confirming
> > that the hypotheses are accurate) and deploying it is deja vu, we've been
> > doing it for a decade, and we still haven't got load management right.
> >  We're just layering more complexity onto a system that we already don't
> > understand, based on guesses as to what the problems were with the previous
> > iteration that we can't test because the system is too complicated with too
> > many interactions for anyone to get their heads around it.
> 
> That is because we do not have the time or funding to empirically test 
> hypotheses. We don't gather enough data, we don't have a huge testnet to try 
> stuff on over extended periods, and so on. Most software development has 
> deadlines and resource limitations, but most software development doesn't 
> NEED to be empirical, at least not very often! See my other mail about this.
> > 
> > If something isn't working, and we don't understand for definite what is
> > wrong with it, then we shouldn't build more on top of it in the hope that
> > we'll accidentally solve the problem, we should replace it with something
> > simple enough that we do understand it, right?
> > 
> > The purpose of load management is relatively simple: *Don't allow clients to
> > pump more requests into the network than the network can handle, while
> > ensuring that this workload is distributed across the network in a
> > reasonably efficient manner.  This must be done in a decentralized way that
> > is not vulnerable to abuse.*
> > *
> > The current load management system includes many different interacting
> > components that make it nearly impossible to understand or debug.  I think
> > we need to go back to the drawing board with load management, starting from
> > the goal I cite above.
> > 
> > I would invite people to suggest the simplest possible load management
> > schemes that might work, let's then discuss them, and figure out which is
> > most likely to work, and if it doesn't work, which will be easiest to debug.
> > 
> > We can bear in mind a few lesson's we've learned though.  What
> > characteristics should our load balancing system have?
> > 
> >- Dropping requests should be extremely rare because this
> >just exacerbates overloading
> >- Delaying requests should also be extremely rare for the same reason
> >- Misrouting requests should be limited, but perhaps acceptable
> >occasionally, for the same reason
> 
> Misrouting is unacceptable, in general. Extremely overloaded or extremely low 
> capacity nodes may be routed around. We might even allow some bounded amount 
> of misrouting in the more general case (e.g. go to either of the top two 
> peers for the key). But in general, transforming load into misrouting (or 
> into reduced HTL, or any other bogus escape valve) is a bad idea. We need to 
> reduce the incoming load.
> > 
> > It therefore really comes down to nodes anticipating whether the network is
> > in danger of having to drop, delay, or misroute requests, and reduce the
> > rate at which they are pumping requests into the network if this danger
> > point is reached.
> > 
> > So how do nodes inform each-other that they are at risk of having to drop,
> > delay, or misroute requests?  There are two reasons this might happen.
> >  Either a node itself is close to its own capacity for relaying requests, or
> > the other nodes it relays to are themselves at or close to capacity.
> > 
> > One problem I'm concerned about when nodes share information about how
> > overloaded their peers are is a "gridlock":
> > 
> > What if all nodes told other nodes that they were overloaded because all
> > their peers are overloaded.  Such a situation would basically cause the
> > entire network to think it was overloaded, even though nobody actually was!
> >  It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can this
> > be avoided?
> 
> Exactly. There are two possible approaches:
> 1. Nodes control the rate at which *local* requests are initiated, based on 
> how many slow-down signals they get.
> 
> (This ensures that such gridlock or feedback loop problems can't happen)
> 
> The main problems are security:
> - Deliberate flooding: Malicious nodes not following AIMD's, in order to clog 
> up the network.
> 
> Fair sharing can limit this. Limiting backoff so that we only misroute when 
> nodes are severely overloaded can also help. We could queue 

[freenet-dev] Beyond New Load Management

2011-08-29 Thread Matthew Toseland
On Saturday 27 Aug 2011 21:35:58 Ian Clarke wrote:
> Matthew,
> 
> This makes sense from the perspective of making incremental changes, but I
> think we need to be more drastic than that.  I think we need to go back to
> the drawing board with load management.  We need to find a solution that is
> simple enough to reason about, and to debug if we have problems with it.
> 
> The entire approach of coming up with hypotheses about what is wrong,
> building a solution based on these hypotheses (without actually confirming
> that the hypotheses are accurate) and deploying it is deja vu, we've been
> doing it for a decade, and we still haven't got load management right.
>  We're just layering more complexity onto a system that we already don't
> understand, based on guesses as to what the problems were with the previous
> iteration that we can't test because the system is too complicated with too
> many interactions for anyone to get their heads around it.

That is because we do not have the time or funding to empirically test 
hypotheses. We don't gather enough data, we don't have a huge testnet to try 
stuff on over extended periods, and so on. Most software development has 
deadlines and resource limitations, but most software development doesn't NEED 
to be empirical, at least not very often! See my other mail about this.
> 
> If something isn't working, and we don't understand for definite what is
> wrong with it, then we shouldn't build more on top of it in the hope that
> we'll accidentally solve the problem, we should replace it with something
> simple enough that we do understand it, right?
> 
> The purpose of load management is relatively simple: *Don't allow clients to
> pump more requests into the network than the network can handle, while
> ensuring that this workload is distributed across the network in a
> reasonably efficient manner.  This must be done in a decentralized way that
> is not vulnerable to abuse.*
> *
> The current load management system includes many different interacting
> components that make it nearly impossible to understand or debug.  I think
> we need to go back to the drawing board with load management, starting from
> the goal I cite above.
> 
> I would invite people to suggest the simplest possible load management
> schemes that might work, let's then discuss them, and figure out which is
> most likely to work, and if it doesn't work, which will be easiest to debug.
> 
> We can bear in mind a few lesson's we've learned though.  What
> characteristics should our load balancing system have?
> 
>- Dropping requests should be extremely rare because this
>just exacerbates overloading
>- Delaying requests should also be extremely rare for the same reason
>- Misrouting requests should be limited, but perhaps acceptable
>occasionally, for the same reason

Misrouting is unacceptable, in general. Extremely overloaded or extremely low 
capacity nodes may be routed around. We might even allow some bounded amount of 
misrouting in the more general case (e.g. go to either of the top two peers for 
the key). But in general, transforming load into misrouting (or into reduced 
HTL, or any other bogus escape valve) is a bad idea. We need to reduce the 
incoming load.
> 
> It therefore really comes down to nodes anticipating whether the network is
> in danger of having to drop, delay, or misroute requests, and reduce the
> rate at which they are pumping requests into the network if this danger
> point is reached.
> 
> So how do nodes inform each-other that they are at risk of having to drop,
> delay, or misroute requests?  There are two reasons this might happen.
>  Either a node itself is close to its own capacity for relaying requests, or
> the other nodes it relays to are themselves at or close to capacity.
> 
> One problem I'm concerned about when nodes share information about how
> overloaded their peers are is a "gridlock":
> 
> What if all nodes told other nodes that they were overloaded because all
> their peers are overloaded.  Such a situation would basically cause the
> entire network to think it was overloaded, even though nobody actually was!
>  It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can this
> be avoided?

Exactly. There are two possible approaches:
1. Nodes control the rate at which *local* requests are initiated, based on how 
many slow-down signals they get.

(This ensures that such gridlock or feedback loop problems can't happen)

The main problems are security:
- Deliberate flooding: Malicious nodes not following AIMD's, in order to clog 
up the network.

Fair sharing can limit this. Limiting backoff so that we only misroute when 
nodes are severely overloaded can also help. We could queue (provided this 
happens only rarely and briefly), allow bounded misrouting (e.g. route to one 
of the top two routes), or terminate requests (thus freeing up resources).

- Opportunistic flooding: Users hack their nodes to not follow AIMD's 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland  wrote:

>  > >- What is the average load reported by responses this node
> forwarded, per
> > >remote node
> >
> > Ahhh, this one could be interesting - you could use it to penalise nodes
> which spam excessively.
>
> Actually, thinking about it, I don't think this will work. Requests are
> usually distributed evenly across the keyspace, and downstream nodes only
> know they are from this node - not which previous node they are from. So the
> pain will be shared across all the peers sending us requests, including our
> own local requests, and won't be identifiable as due to any single node. You
> have to do it by volume, not by reported load.
>

But if a node is being abusive won't the nodes its requesting from tend to
have a higher than average load?  Perhaps you are right though, but this is
the kind of thought-process we need.

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: ian at freenetproject.org
-- next part --
An HTML attachment was scrubbed...
URL: 



[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland  wrote:

> On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > proposal.  I'm assuming that we get rid of NLM, fair sharing, and
> anything
> > else intended to control load, and replace it with this.  We will
> absolutely
> > need to simulate this before we write a single line of code to deploy it.
>
> I have justified fair sharing in my other reply. However alternate
> mechanisms might replace it.
>
> I agree we should simulate, although this might be significant work.
>

I doubt it would be more work than trying it without simulation and failing
for a decade, which was our previous approach.


> > The core idea is that a node will include a floating point number in
> > response to any kind of request showing how close that node is to being
> > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > completely saturated and must reject requests.  Clearly the goal is to
> avoid
> > getting anywhere near to 1.0.
> >
> > A node tracks several things:
>
> All of these apply to both locally originated requests and remotely
> originated requests? And are they for only our direct peers?
>

Every peer along the reply path will take note of the load value, so it will
be a mixture of the loads of close-by peers, and more distant peers.
 Obviously, since we are more likely to receive messages from peers that are
"closer", our average will automatically be biased towards closer peers.

>- What is the average load reported by responses this node forwarded,
> per
> >remote node
>
> Ahhh, this one could be interesting - you could use it to penalise nodes
> which spam excessively.
>

Exactly.


> > I think, given these metrics, we should be able to do the following:
> >
> >- Limit our overall rate of initiating local requests based on the
> global
> >average reported load
>
> We can already do this on the basis of AIMD's, based purely on behaviour of
> local requests. However, taking more data into account might result in more
> accurate results - although it might also dilute the feedback.
>

Right, AIMD tends to control the rate at which a specific event occurs, like
a dropped IP packet with TCP, or a dropped request as we currently use it.
 This is a bit like signaling that you have a headache by banging your head
against the wall.  It is also a rather coarse way to deliver feedback.

A floating point load number conveys more information in a non-destructive
way.

In any case, this part is "safe": it's not going to cause feedback problems,
> on its own. Conversely, it doesn't deal with problems such as network
> bottlenecks.
>

Well, it depends.  If we are tracking load on a per-outbound-connection
basis then we can potentially take action to manage load per-connection in
addition to overall.


> However, while AIMD's are based on the *global* average load (or rather the
> load of all the nodes along the chain), it sounds like you are proposing to
> only report the *local* load of our direct peers? How are you going to
> compute the global average load?
>

No, its not the local load of our direct peers.  Its the load reported in
all reply messages that we forward, some of those will be from our direct
peers, but most will be from remote peers.


> >- Limit our rate of local requests based on the average load of the
> >connection to the peer it would need to be forwarded to
>
> Not sure I follow here. Are you proposing that we initiate more requests in
> keyspace areas where there is lower load?
>

Well, kinda, except the opposite.  We might reduce requests in keyspaces
where there is higher load.


> >- Detect when remote peers are abusing the system by disregarding load
> -
> >as evidenced by a significantly higher average load in replies
> forwarded to
> >them
>
> Okay, that is useful - although I doubt it would be an unambiguous
> detection, and what would we do about it? We would have to reject requests,
> and if we're not sure, or if it could be a downstream troublemaker, couldn't
> that result in misrouting and make things worse? On the other hand, any
> malicious node would be most easily detected by the nodes directly connected
> to it, so this could work much the same way as fair sharing - provided that
> it can reject or terminate the requests.
>

No, we probably wouldn't start rejecting requests - more likely we would
disconnect from that node.  Obviously we'd need to be fairly sure of abusive
behavior before doing that - but abuse should be rare.


> Arguably fair sharing achieves the same objective. It could be extended in
> various ways e.g. if we are relaying a lot of slow-down notices to a peer we
> could start to think about squeezing the peer's allocation.
>

Yes, except it doesn't work, and its too complicated to reason about.


> > Of course, lots of questions:
> >
> >- How do we compute the averages?  A 

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
Ok, thinking about it, here is a proposal, or  rather, the beginning of a
proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
else intended to control load, and replace it with this.  We will absolutely
need to simulate this before we write a single line of code to deploy it.

The core idea is that a node will include a floating point number in
response to any kind of request showing how close that node is to being
overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
completely saturated and must reject requests.  Clearly the goal is to avoid
getting anywhere near to 1.0.

A node tracks several things:

   - What is the overall average load reported by responses this node has
   received
   - What is the average load reported by responses this node has received,
   per remote node
   - What is the average load reported by responses this node forwarded, per
   remote node

I think, given these metrics, we should be able to do the following:

   - Limit our overall rate of initiating local requests based on the global
   average reported load
   - Limit our rate of local requests based on the average load of the
   connection to the peer it would need to be forwarded to
   - Detect when remote peers are abusing the system by disregarding load -
   as evidenced by a significantly higher average load in replies forwarded to
   them

Of course, lots of questions:

   - How do we compute the averages?  A decaying running average of some
   kind?  What time period?
   - How do we translate load into a desired rate of requests?
   - What criteria indicate that a peer is abusing the system?  What is the
   remedy?

This is basically control theory stuff, and we'll need robust answers to
these questions before we proceed (ideally with a theoretical rather than
experimental foundation).

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: ian at freenetproject.org
-- next part --
An HTML attachment was scrubbed...
URL: 



[freenet-dev] Beyond New Load Management

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 12:37 PM, Matthew Toseland <
toad at amphibian.dyndns.org> wrote:

> That is because we do not have the time or funding to empirically test
> hypotheses. We don't gather enough data, we don't have a huge testnet to try
> stuff on over extended periods, and so on. Most software development has
> deadlines and resource limitations, but most software development doesn't
> NEED to be empirical, at least not very often! See my other mail about this.
>

I don't think the problem is a lack of time or funding.  The problem is that
we come up with solutions that are too complicated to analyze or fix when
they don't work.  Throwing more time and money at that might help, but that
is addressing the symptom, not the cause.  The cause is complexity, which
just grows and grows as we try to fix problems we don't understand by
layering on more complexity.


> Misrouting is unacceptable, in general. Extremely overloaded or extremely
> low capacity nodes may be routed around. We might even allow some bounded
> amount of misrouting in the more general case (e.g. go to either of the top
> two peers for the key). But in general, transforming load into misrouting
> (or into reduced HTL, or any other bogus escape valve) is a bad idea. We
> need to reduce the incoming load.
>

Right, the same is true of queueing.  If nodes are forced to do things to
deal with overloading that make the problem worse then the load balancing
algorithm has failed.  Its job is to prevent that from happening.


> > What if all nodes told other nodes that they were overloaded because all
> > their peers are overloaded.  Such a situation would basically cause the
> > entire network to think it was overloaded, even though nobody actually
> was!
> >  It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can
> this
> > be avoided?
>
> Exactly. There are two possible approaches:
> 1. Nodes control the rate at which *local* requests are initiated, based on
> how many slow-down signals they get.
>
> (This ensures that such gridlock or feedback loop problems can't happen)
>

Ah, so slow-downs only ever affect local request initiation?  The problem
there is that if a peer is overloaded, only those peers that are directly
connected to it will slow down their rate of requests.  Peers elsewhere on
the network will continue to fire requests at it, and there will be no way
for them to know they need to slow down :-/

Unless... what if the DataReply message (or whatever its called these days)
contains a slow-down request, and its respected by every peer along that
path according to AIMD?  The problem is that there will be a very week link
between the rate of requests, and the rate of slow-down messages.  It would
be like having 10,000 individual heaters with thermostats all trying to
regulate the temperature of a gigantic space.  Each individual
thermostat/heater will have an almost imperceptible impact.  Although
perhaps that would work...?

Fair sharing can limit this. Limiting backoff so that we only misroute when
> nodes are severely overloaded can also help. We could queue (provided this
> happens only rarely and briefly), allow bounded misrouting (e.g. route to
> one of the top two routes), or terminate requests (thus freeing up
> resources).
>

I think "fair sharing" is one of those complexities that we need to go back
to the drawing board on.  Its clear that even now we have hypotheses about
how fair sharing could be causing problems.  Its too complicated to really
reason about, or to anticipate all of its implications.


> - The easiest way to implement #1 is with AIMD's. We can keep them, but
> send slow-down messages earlier.
> - We could keep NLM, i.e. queue for limited times to smooth out delays.


Doesn't that violate the principal that the load balancing stuff shouldn't
do anything that makes the problem worse?  Queueing does exactly that.


> Currently we RNF after a (longish) period waiting for a node to route to.
> We would send a slow-down message when we have queued for more than a
> certain period.  This means we can have a fixed, limited amount of
> misrouting, we can keep NLM's clear benefits for routing accuracy (as
> visible in success rates), and we can ensure that the input load is low
> enough to avoid severe slowdown due to queueing for a long time.
>

No, I think we need to avoid queueing except in case of emergency.  Queueing
only makes things worse by tying up more resources for longer.

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: ian at freenetproject.org
-- next part --
An HTML attachment was scrubbed...
URL: 



Re: [freenet-dev] Beyond New Load Management

2011-08-29 Thread Matthew Toseland
On Saturday 27 Aug 2011 21:35:58 Ian Clarke wrote:
 Matthew,
 
 This makes sense from the perspective of making incremental changes, but I
 think we need to be more drastic than that.  I think we need to go back to
 the drawing board with load management.  We need to find a solution that is
 simple enough to reason about, and to debug if we have problems with it.
 
 The entire approach of coming up with hypotheses about what is wrong,
 building a solution based on these hypotheses (without actually confirming
 that the hypotheses are accurate) and deploying it is deja vu, we've been
 doing it for a decade, and we still haven't got load management right.
  We're just layering more complexity onto a system that we already don't
 understand, based on guesses as to what the problems were with the previous
 iteration that we can't test because the system is too complicated with too
 many interactions for anyone to get their heads around it.

That is because we do not have the time or funding to empirically test 
hypotheses. We don't gather enough data, we don't have a huge testnet to try 
stuff on over extended periods, and so on. Most software development has 
deadlines and resource limitations, but most software development doesn't NEED 
to be empirical, at least not very often! See my other mail about this.
 
 If something isn't working, and we don't understand for definite what is
 wrong with it, then we shouldn't build more on top of it in the hope that
 we'll accidentally solve the problem, we should replace it with something
 simple enough that we do understand it, right?
 
 The purpose of load management is relatively simple: *Don't allow clients to
 pump more requests into the network than the network can handle, while
 ensuring that this workload is distributed across the network in a
 reasonably efficient manner.  This must be done in a decentralized way that
 is not vulnerable to abuse.*
 *
 The current load management system includes many different interacting
 components that make it nearly impossible to understand or debug.  I think
 we need to go back to the drawing board with load management, starting from
 the goal I cite above.
 
 I would invite people to suggest the simplest possible load management
 schemes that might work, let's then discuss them, and figure out which is
 most likely to work, and if it doesn't work, which will be easiest to debug.
 
 We can bear in mind a few lesson's we've learned though.  What
 characteristics should our load balancing system have?
 
- Dropping requests should be extremely rare because this
just exacerbates overloading
- Delaying requests should also be extremely rare for the same reason
- Misrouting requests should be limited, but perhaps acceptable
occasionally, for the same reason

Misrouting is unacceptable, in general. Extremely overloaded or extremely low 
capacity nodes may be routed around. We might even allow some bounded amount of 
misrouting in the more general case (e.g. go to either of the top two peers for 
the key). But in general, transforming load into misrouting (or into reduced 
HTL, or any other bogus escape valve) is a bad idea. We need to reduce the 
incoming load.
 
 It therefore really comes down to nodes anticipating whether the network is
 in danger of having to drop, delay, or misroute requests, and reduce the
 rate at which they are pumping requests into the network if this danger
 point is reached.
 
 So how do nodes inform each-other that they are at risk of having to drop,
 delay, or misroute requests?  There are two reasons this might happen.
  Either a node itself is close to its own capacity for relaying requests, or
 the other nodes it relays to are themselves at or close to capacity.
 
 One problem I'm concerned about when nodes share information about how
 overloaded their peers are is a gridlock:
 
 What if all nodes told other nodes that they were overloaded because all
 their peers are overloaded.  Such a situation would basically cause the
 entire network to think it was overloaded, even though nobody actually was!
  It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can this
 be avoided?

Exactly. There are two possible approaches:
1. Nodes control the rate at which *local* requests are initiated, based on how 
many slow-down signals they get.

(This ensures that such gridlock or feedback loop problems can't happen)

The main problems are security:
- Deliberate flooding: Malicious nodes not following AIMD's, in order to clog 
up the network.

Fair sharing can limit this. Limiting backoff so that we only misroute when 
nodes are severely overloaded can also help. We could queue (provided this 
happens only rarely and briefly), allow bounded misrouting (e.g. route to one 
of the top two routes), or terminate requests (thus freeing up resources).

- Opportunistic flooding: Users hack their nodes to not follow AIMD's because 
it speeds up their downloads.

This is harder than 

Re: [freenet-dev] Beyond New Load Management

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 18:37:39 Matthew Toseland wrote:
 On Saturday 27 Aug 2011 21:35:58 Ian Clarke wrote:
  Matthew,
  
  This makes sense from the perspective of making incremental changes, but I
  think we need to be more drastic than that.  I think we need to go back to
  the drawing board with load management.  We need to find a solution that is
  simple enough to reason about, and to debug if we have problems with it.
  
  The entire approach of coming up with hypotheses about what is wrong,
  building a solution based on these hypotheses (without actually confirming
  that the hypotheses are accurate) and deploying it is deja vu, we've been
  doing it for a decade, and we still haven't got load management right.
   We're just layering more complexity onto a system that we already don't
  understand, based on guesses as to what the problems were with the previous
  iteration that we can't test because the system is too complicated with too
  many interactions for anyone to get their heads around it.
 
 That is because we do not have the time or funding to empirically test 
 hypotheses. We don't gather enough data, we don't have a huge testnet to try 
 stuff on over extended periods, and so on. Most software development has 
 deadlines and resource limitations, but most software development doesn't 
 NEED to be empirical, at least not very often! See my other mail about this.
  
  If something isn't working, and we don't understand for definite what is
  wrong with it, then we shouldn't build more on top of it in the hope that
  we'll accidentally solve the problem, we should replace it with something
  simple enough that we do understand it, right?
  
  The purpose of load management is relatively simple: *Don't allow clients to
  pump more requests into the network than the network can handle, while
  ensuring that this workload is distributed across the network in a
  reasonably efficient manner.  This must be done in a decentralized way that
  is not vulnerable to abuse.*
  *
  The current load management system includes many different interacting
  components that make it nearly impossible to understand or debug.  I think
  we need to go back to the drawing board with load management, starting from
  the goal I cite above.
  
  I would invite people to suggest the simplest possible load management
  schemes that might work, let's then discuss them, and figure out which is
  most likely to work, and if it doesn't work, which will be easiest to debug.
  
  We can bear in mind a few lesson's we've learned though.  What
  characteristics should our load balancing system have?
  
 - Dropping requests should be extremely rare because this
 just exacerbates overloading
 - Delaying requests should also be extremely rare for the same reason
 - Misrouting requests should be limited, but perhaps acceptable
 occasionally, for the same reason
 
 Misrouting is unacceptable, in general. Extremely overloaded or extremely low 
 capacity nodes may be routed around. We might even allow some bounded amount 
 of misrouting in the more general case (e.g. go to either of the top two 
 peers for the key). But in general, transforming load into misrouting (or 
 into reduced HTL, or any other bogus escape valve) is a bad idea. We need to 
 reduce the incoming load.
  
  It therefore really comes down to nodes anticipating whether the network is
  in danger of having to drop, delay, or misroute requests, and reduce the
  rate at which they are pumping requests into the network if this danger
  point is reached.
  
  So how do nodes inform each-other that they are at risk of having to drop,
  delay, or misroute requests?  There are two reasons this might happen.
   Either a node itself is close to its own capacity for relaying requests, or
  the other nodes it relays to are themselves at or close to capacity.
  
  One problem I'm concerned about when nodes share information about how
  overloaded their peers are is a gridlock:
  
  What if all nodes told other nodes that they were overloaded because all
  their peers are overloaded.  Such a situation would basically cause the
  entire network to think it was overloaded, even though nobody actually was!
   It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can this
  be avoided?
 
 Exactly. There are two possible approaches:
 1. Nodes control the rate at which *local* requests are initiated, based on 
 how many slow-down signals they get.
 
 (This ensures that such gridlock or feedback loop problems can't happen)
 
 The main problems are security:
 - Deliberate flooding: Malicious nodes not following AIMD's, in order to clog 
 up the network.
 
 Fair sharing can limit this. Limiting backoff so that we only misroute when 
 nodes are severely overloaded can also help. We could queue (provided this 
 happens only rarely and briefly), allow bounded misrouting (e.g. route to one 
 of the top two routes), or terminate requests (thus freeing up 

Re: [freenet-dev] Beyond New Load Management

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 12:37 PM, Matthew Toseland 
t...@amphibian.dyndns.org wrote:

 That is because we do not have the time or funding to empirically test
 hypotheses. We don't gather enough data, we don't have a huge testnet to try
 stuff on over extended periods, and so on. Most software development has
 deadlines and resource limitations, but most software development doesn't
 NEED to be empirical, at least not very often! See my other mail about this.


I don't think the problem is a lack of time or funding.  The problem is that
we come up with solutions that are too complicated to analyze or fix when
they don't work.  Throwing more time and money at that might help, but that
is addressing the symptom, not the cause.  The cause is complexity, which
just grows and grows as we try to fix problems we don't understand by
layering on more complexity.


 Misrouting is unacceptable, in general. Extremely overloaded or extremely
 low capacity nodes may be routed around. We might even allow some bounded
 amount of misrouting in the more general case (e.g. go to either of the top
 two peers for the key). But in general, transforming load into misrouting
 (or into reduced HTL, or any other bogus escape valve) is a bad idea. We
 need to reduce the incoming load.


Right, the same is true of queueing.  If nodes are forced to do things to
deal with overloading that make the problem worse then the load balancing
algorithm has failed.  Its job is to prevent that from happening.


  What if all nodes told other nodes that they were overloaded because all
  their peers are overloaded.  Such a situation would basically cause the
  entire network to think it was overloaded, even though nobody actually
 was!
   It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can
 this
  be avoided?

 Exactly. There are two possible approaches:
 1. Nodes control the rate at which *local* requests are initiated, based on
 how many slow-down signals they get.

 (This ensures that such gridlock or feedback loop problems can't happen)


Ah, so slow-downs only ever affect local request initiation?  The problem
there is that if a peer is overloaded, only those peers that are directly
connected to it will slow down their rate of requests.  Peers elsewhere on
the network will continue to fire requests at it, and there will be no way
for them to know they need to slow down :-/

Unless... what if the DataReply message (or whatever its called these days)
contains a slow-down request, and its respected by every peer along that
path according to AIMD?  The problem is that there will be a very week link
between the rate of requests, and the rate of slow-down messages.  It would
be like having 10,000 individual heaters with thermostats all trying to
regulate the temperature of a gigantic space.  Each individual
thermostat/heater will have an almost imperceptible impact.  Although
perhaps that would work...?

Fair sharing can limit this. Limiting backoff so that we only misroute when
 nodes are severely overloaded can also help. We could queue (provided this
 happens only rarely and briefly), allow bounded misrouting (e.g. route to
 one of the top two routes), or terminate requests (thus freeing up
 resources).


I think fair sharing is one of those complexities that we need to go back
to the drawing board on.  Its clear that even now we have hypotheses about
how fair sharing could be causing problems.  Its too complicated to really
reason about, or to anticipate all of its implications.


 - The easiest way to implement #1 is with AIMD's. We can keep them, but
 send slow-down messages earlier.
 - We could keep NLM, i.e. queue for limited times to smooth out delays.


Doesn't that violate the principal that the load balancing stuff shouldn't
do anything that makes the problem worse?  Queueing does exactly that.


 Currently we RNF after a (longish) period waiting for a node to route to.
 We would send a slow-down message when we have queued for more than a
 certain period.  This means we can have a fixed, limited amount of
 misrouting, we can keep NLM's clear benefits for routing accuracy (as
 visible in success rates), and we can ensure that the input load is low
 enough to avoid severe slowdown due to queueing for a long time.


No, I think we need to avoid queueing except in case of emergency.  Queueing
only makes things worse by tying up more resources for longer.

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: i...@freenetproject.org
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
Ok, thinking about it, here is a proposal, or  rather, the beginning of a
proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
else intended to control load, and replace it with this.  We will absolutely
need to simulate this before we write a single line of code to deploy it.

The core idea is that a node will include a floating point number in
response to any kind of request showing how close that node is to being
overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
completely saturated and must reject requests.  Clearly the goal is to avoid
getting anywhere near to 1.0.

A node tracks several things:

   - What is the overall average load reported by responses this node has
   received
   - What is the average load reported by responses this node has received,
   per remote node
   - What is the average load reported by responses this node forwarded, per
   remote node

I think, given these metrics, we should be able to do the following:

   - Limit our overall rate of initiating local requests based on the global
   average reported load
   - Limit our rate of local requests based on the average load of the
   connection to the peer it would need to be forwarded to
   - Detect when remote peers are abusing the system by disregarding load -
   as evidenced by a significantly higher average load in replies forwarded to
   them

Of course, lots of questions:

   - How do we compute the averages?  A decaying running average of some
   kind?  What time period?
   - How do we translate load into a desired rate of requests?
   - What criteria indicate that a peer is abusing the system?  What is the
   remedy?

This is basically control theory stuff, and we'll need robust answers to
these questions before we proceed (ideally with a theoretical rather than
experimental foundation).

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: i...@freenetproject.org
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 18:58:26 Ian Clarke wrote:
 On Mon, Aug 29, 2011 at 12:37 PM, Matthew Toseland 
 t...@amphibian.dyndns.org wrote:
 
  That is because we do not have the time or funding to empirically test
  hypotheses. We don't gather enough data, we don't have a huge testnet to try
  stuff on over extended periods, and so on. Most software development has
  deadlines and resource limitations, but most software development doesn't
  NEED to be empirical, at least not very often! See my other mail about this.
 
 I don't think the problem is a lack of time or funding.  The problem is that
 we come up with solutions that are too complicated to analyze or fix when
 they don't work.  Throwing more time and money at that might help, but that
 is addressing the symptom, not the cause.  The cause is complexity, which
 just grows and grows as we try to fix problems we don't understand by
 layering on more complexity.

Perhaps. Ripping everything out and starting again is always tempting 
especially when subordinates have been doing things without asking you. :)
 
  Misrouting is unacceptable, in general. Extremely overloaded or extremely
  low capacity nodes may be routed around. We might even allow some bounded
  amount of misrouting in the more general case (e.g. go to either of the top
  two peers for the key). But in general, transforming load into misrouting
  (or into reduced HTL, or any other bogus escape valve) is a bad idea. We
  need to reduce the incoming load.
 
 Right, the same is true of queueing.  If nodes are forced to do things to
 deal with overloading that make the problem worse then the load balancing
 algorithm has failed.  Its job is to prevent that from happening.

Okay.
 
   What if all nodes told other nodes that they were overloaded because all
   their peers are overloaded.  Such a situation would basically cause the
   entire network to think it was overloaded, even though nobody actually
  was!
It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can
  this
   be avoided?
 
  Exactly. There are two possible approaches:
  1. Nodes control the rate at which *local* requests are initiated, based on
  how many slow-down signals they get.
 
  (This ensures that such gridlock or feedback loop problems can't happen)
 
 Ah, so slow-downs only ever affect local request initiation?  The problem
 there is that if a peer is overloaded, only those peers that are directly
 connected to it will slow down their rate of requests.  Peers elsewhere on
 the network will continue to fire requests at it, and there will be no way
 for them to know they need to slow down :-/

No, because the slow-down's are propagated back to the original request 
originator, and only the request originator takes notice of them. That's how it 
works now.
 
 Unless... what if the DataReply message (or whatever its called these days)
 contains a slow-down request, and its respected by every peer along that
 path according to AIMD?  The problem is that there will be a very week link
 between the rate of requests, and the rate of slow-down messages.  It would
 be like having 10,000 individual heaters with thermostats all trying to
 regulate the temperature of a gigantic space.  Each individual
 thermostat/heater will have an almost imperceptible impact.  Although
 perhaps that would work...?

The rest is plausible - feedback might be too slow, but this can be helped in 
part by an increased window size between the point at which we start sending 
slow-down messages and the point at which things start breaking more badly.
 
 Fair sharing can limit this. Limiting backoff so that we only misroute when
  nodes are severely overloaded can also help. We could queue (provided this
  happens only rarely and briefly), allow bounded misrouting (e.g. route to
  one of the top two routes), or terminate requests (thus freeing up
  resources).
 
 I think fair sharing is one of those complexities that we need to go back
 to the drawing board on.  Its clear that even now we have hypotheses about
 how fair sharing could be causing problems.  Its too complicated to really
 reason about, or to anticipate all of its implications.

It is vital for:
1. Security: It quashes DoS attacks/floods quickly. Without it, any attacker 
can easily swamp any node it is connected to, using all of its capacity for 
sending its flood onward. With it, they are limited to a fraction of the node's 
capacity. This is a concern that load management must deal with!
2. Nodes with low numbers of peers, and slower nodes in general, to get a look 
in, rather than be rejected at the same rate as everyone else.
 
  - The easiest way to implement #1 is with AIMD's. We can keep them, but
  send slow-down messages earlier.
  - We could keep NLM, i.e. queue for limited times to smooth out delays.
 
 Doesn't that violate the principal that the load balancing stuff shouldn't
 do anything that makes the problem worse?  Queueing does exactly that.

So what do you 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
 Ok, thinking about it, here is a proposal, or  rather, the beginning of a
 proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
 else intended to control load, and replace it with this.  We will absolutely
 need to simulate this before we write a single line of code to deploy it.

I have justified fair sharing in my other reply. However alternate mechanisms 
might replace it.

I agree we should simulate, although this might be significant work.
 
 The core idea is that a node will include a floating point number in
 response to any kind of request showing how close that node is to being
 overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
 completely saturated and must reject requests.  Clearly the goal is to avoid
 getting anywhere near to 1.0.
 
 A node tracks several things:

All of these apply to both locally originated requests and remotely originated 
requests? And are they for only our direct peers?
 
- What is the overall average load reported by responses this node has
received
- What is the average load reported by responses this node has received,
per remote node
- What is the average load reported by responses this node forwarded, per
remote node

Ahhh, this one could be interesting - you could use it to penalise nodes which 
spam excessively.
 
 I think, given these metrics, we should be able to do the following:
 
- Limit our overall rate of initiating local requests based on the global
average reported load

We can already do this on the basis of AIMD's, based purely on behaviour of 
local requests. However, taking more data into account might result in more 
accurate results - although it might also dilute the feedback.

In any case, this part is safe: it's not going to cause feedback problems, on 
its own. Conversely, it doesn't deal with problems such as network bottlenecks.

However, while AIMD's are based on the *global* average load (or rather the 
load of all the nodes along the chain), it sounds like you are proposing to 
only report the *local* load of our direct peers? How are you going to compute 
the global average load?

- Limit our rate of local requests based on the average load of the
connection to the peer it would need to be forwarded to

Not sure I follow here. Are you proposing that we initiate more requests in 
keyspace areas where there is lower load?

- Detect when remote peers are abusing the system by disregarding load -
as evidenced by a significantly higher average load in replies forwarded to
them

Okay, that is useful - although I doubt it would be an unambiguous detection, 
and what would we do about it? We would have to reject requests, and if we're 
not sure, or if it could be a downstream troublemaker, couldn't that result in 
misrouting and make things worse? On the other hand, any malicious node would 
be most easily detected by the nodes directly connected to it, so this could 
work much the same way as fair sharing - provided that it can reject or 
terminate the requests.

Arguably fair sharing achieves the same objective. It could be extended in 
various ways e.g. if we are relaying a lot of slow-down notices to a peer we 
could start to think about squeezing the peer's allocation.
 
 Of course, lots of questions:
 
- How do we compute the averages?  A decaying running average of some
kind?  What time period?

Yay more tunables. :(

- How do we translate load into a desired rate of requests?

And even more tunables. :(

- What criteria indicate that a peer is abusing the system?  What is the
remedy?

In the current system, don't accept their requests.

In ANY system, there will always be a borderline below which they can get away 
with it. We can either try to make this as small as possible or try not to hit 
legitimate traffic too hard.
 
 This is basically control theory stuff, and we'll need robust answers to
 these questions before we proceed (ideally with a theoretical rather than
 experimental foundation).

IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much the 
same without having to wait years for our theoreticians to wake up.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:09:58 Matthew Toseland wrote:
 On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
  Ok, thinking about it, here is a proposal, or  rather, the beginning of a
  proposal.  I'm assuming that we get rid of NLM, fair sharing, and anything
  else intended to control load, and replace it with this.  We will absolutely
  need to simulate this before we write a single line of code to deploy it.
 
 I have justified fair sharing in my other reply. However alternate mechanisms 
 might replace it.
 
 I agree we should simulate, although this might be significant work.
  
  The core idea is that a node will include a floating point number in
  response to any kind of request showing how close that node is to being
  overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
  completely saturated and must reject requests.  Clearly the goal is to avoid
  getting anywhere near to 1.0.
  
  A node tracks several things:
 
 All of these apply to both locally originated requests and remotely 
 originated requests? And are they for only our direct peers?
  
 - What is the overall average load reported by responses this node has
 received
 - What is the average load reported by responses this node has received,
 per remote node
 - What is the average load reported by responses this node forwarded, per
 remote node
 
 Ahhh, this one could be interesting - you could use it to penalise nodes 
 which spam excessively.

Actually, thinking about it, I don't think this will work. Requests are usually 
distributed evenly across the keyspace, and downstream nodes only know they are 
from this node - not which previous node they are from. So the pain will be 
shared across all the peers sending us requests, including our own local 
requests, and won't be identifiable as due to any single node. You have to do 
it by volume, not by reported load.


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland t...@amphibian.dyndns.org
 wrote:

 On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
  Ok, thinking about it, here is a proposal, or  rather, the beginning of a
  proposal.  I'm assuming that we get rid of NLM, fair sharing, and
 anything
  else intended to control load, and replace it with this.  We will
 absolutely
  need to simulate this before we write a single line of code to deploy it.

 I have justified fair sharing in my other reply. However alternate
 mechanisms might replace it.

 I agree we should simulate, although this might be significant work.


I doubt it would be more work than trying it without simulation and failing
for a decade, which was our previous approach.


  The core idea is that a node will include a floating point number in
  response to any kind of request showing how close that node is to being
  overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
  completely saturated and must reject requests.  Clearly the goal is to
 avoid
  getting anywhere near to 1.0.
 
  A node tracks several things:

 All of these apply to both locally originated requests and remotely
 originated requests? And are they for only our direct peers?


Every peer along the reply path will take note of the load value, so it will
be a mixture of the loads of close-by peers, and more distant peers.
 Obviously, since we are more likely to receive messages from peers that are
closer, our average will automatically be biased towards closer peers.

- What is the average load reported by responses this node forwarded,
 per
 remote node

 Ahhh, this one could be interesting - you could use it to penalise nodes
 which spam excessively.


Exactly.


  I think, given these metrics, we should be able to do the following:
 
 - Limit our overall rate of initiating local requests based on the
 global
 average reported load

 We can already do this on the basis of AIMD's, based purely on behaviour of
 local requests. However, taking more data into account might result in more
 accurate results - although it might also dilute the feedback.


Right, AIMD tends to control the rate at which a specific event occurs, like
a dropped IP packet with TCP, or a dropped request as we currently use it.
 This is a bit like signaling that you have a headache by banging your head
against the wall.  It is also a rather coarse way to deliver feedback.

A floating point load number conveys more information in a non-destructive
way.

In any case, this part is safe: it's not going to cause feedback problems,
 on its own. Conversely, it doesn't deal with problems such as network
 bottlenecks.


Well, it depends.  If we are tracking load on a per-outbound-connection
basis then we can potentially take action to manage load per-connection in
addition to overall.


 However, while AIMD's are based on the *global* average load (or rather the
 load of all the nodes along the chain), it sounds like you are proposing to
 only report the *local* load of our direct peers? How are you going to
 compute the global average load?


No, its not the local load of our direct peers.  Its the load reported in
all reply messages that we forward, some of those will be from our direct
peers, but most will be from remote peers.


 - Limit our rate of local requests based on the average load of the
 connection to the peer it would need to be forwarded to

 Not sure I follow here. Are you proposing that we initiate more requests in
 keyspace areas where there is lower load?


Well, kinda, except the opposite.  We might reduce requests in keyspaces
where there is higher load.


 - Detect when remote peers are abusing the system by disregarding load
 -
 as evidenced by a significantly higher average load in replies
 forwarded to
 them

 Okay, that is useful - although I doubt it would be an unambiguous
 detection, and what would we do about it? We would have to reject requests,
 and if we're not sure, or if it could be a downstream troublemaker, couldn't
 that result in misrouting and make things worse? On the other hand, any
 malicious node would be most easily detected by the nodes directly connected
 to it, so this could work much the same way as fair sharing - provided that
 it can reject or terminate the requests.


No, we probably wouldn't start rejecting requests - more likely we would
disconnect from that node.  Obviously we'd need to be fairly sure of abusive
behavior before doing that - but abuse should be rare.


 Arguably fair sharing achieves the same objective. It could be extended in
 various ways e.g. if we are relaying a lot of slow-down notices to a peer we
 could start to think about squeezing the peer's allocation.


Yes, except it doesn't work, and its too complicated to reason about.


  Of course, lots of questions:
 
 - How do we compute the averages?  A decaying running average of some
 kind?  What time period?

 Yay more tunables. 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Ian Clarke
On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland t...@amphibian.dyndns.org
 wrote:

   - What is the average load reported by responses this node
 forwarded, per
  remote node
 
  Ahhh, this one could be interesting - you could use it to penalise nodes
 which spam excessively.

 Actually, thinking about it, I don't think this will work. Requests are
 usually distributed evenly across the keyspace, and downstream nodes only
 know they are from this node - not which previous node they are from. So the
 pain will be shared across all the peers sending us requests, including our
 own local requests, and won't be identifiable as due to any single node. You
 have to do it by volume, not by reported load.


But if a node is being abusive won't the nodes its requesting from tend to
have a higher than average load?  Perhaps you are right though, but this is
the kind of thought-process we need.

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: i...@freenetproject.org
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Babenhauserheide
Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
 Yes, small tweaks have worked so well for us for the last decade, leaving us
 pretty-much where we were in 2003.  No, we don't understand how the current
 system works, there is no point in trying to fix something when we don't
 even know what is broken.

I’d like to present a clue what is broken in NLM. Before I kill you with the
log, here’s the result:


With NLM the latency of a request is a function of the raw bandwidth
(not so with OLM), and NLM used only half my bandwidth after it had
been deployed for 2 days (at the start much more).


τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time
(time to find the node), nlm: new load management, olm: old load management.



So first step: make sure all bandwidth gets used - maybe by allocating more
slots till we use all allowed bandwidth. Better having to throttle a transfer
than not using bandwidth.


*NLM should with the current network be slower than OLM by 23%. But in 18
months it should actually be faster by ~8% — given Moores Law holds for upload
bandwidth — because the routes are shorter.*


The main advantage of NLM is, that it should be much more resilient against
attackers (DoS).


Now to the log - it’s math and not cleaned up; you have been warned :)


ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf.
ArneBab queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀,
w: success rate
ArneBab ψs = τ(h) + q(h)
ArneBab ψf = q(h)
ArneBab ψ ~ w₁·ψs + (1-w₁)·ψf
ArneBab σs = τ(h) + q(h)
ArneBab σf = q(h)
ArneBab σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
ArneBab num(ψ) / num(σ) ~ 1
ArneBab → time ~ σ + ψ
ArneBab q(h) depends on timeouts, as do w₁ and w₂
ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
ArneBab = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
w₂)·q(h)
ArneBab = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
ArneBab = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
ArneBab in the congestion case q(h) ~ timeout
ArneBab timeout = o
ArneBab timeout: o
ArneBab w depends on the timeout *somehow*, but inversely
ArneBab o=0 → w=0
ArneBab assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
ArneBab assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
ArneBab correction: in the congestion case: q(h) ~ min(timeout, τ(h))
ArneBab timeout matters for q(h) only when timeout  τ(h)
ArneBab I try to: I still need a dependency of w on timeout
ArneBab … lets call it t(w)
ArneBab better: w(o) :)
toad_ well, if there is a timeout, we have a fixed time, but we reduce the
hops ...
toad_ i thought w was success rate
ArneBab ah!
ArneBab and the success rates where in the NLM stats
ArneBab going mostly smoothly from 60% to 0%
ArneBab for the HTL
toad_ right, success rate peaks at 18 or sometimes 16
toad_ what are w1 vs w2?
toad_ chk vs ssk i guess
ArneBab yes
-*- toad_ thinks considering both is probably overambitious at this stage?
ArneBab should not be too bad: SSKs drop much more rapidly at decreasing
hops
ArneBab hops→HTL
toad_ ψs is time for a successful chk; ψf is time for a failed chk ... in
which case h in the first instance is low, and in the second instance is h0
ArneBab yes
toad_ okay, i don't follow this line: time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs +
(1-w₂)·ψf
toad_ i thought w2 related to SSKs?
ArneBab uh, yes…
ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·σs + (1-w₂)·σf
toad_ you have to appreciate i'm only just getting back into maths and
physics after 12 years ...
toad_ (retaking a-levels to get a degree)
ArneBab no probs, I’m also no expert in this. I try to get a relation
between the time and the timeout, so we can try to find a minimum
toad_ in any case, there are two different h's for the two uses of q(h) - h0
and h_avg
toad_ h_avg for success and h0 for failure
ArneBab hm, yes
ArneBab which makes this harder…
ArneBab it’s wrong anyway… the q(h_avg) was missing
toad_ h_avg is somewhere between 5 and 10 imho
toad_ at least it is if everything is working well and the input load isn't
all really popular stuff (in which case it's answered quickly and can be
ignored)
ArneBab = τ(h) · (w₁+w₂) + q(h) · (2-w₁-w₂) + q(h_avg) · (w₁+w₂)
ArneBab would have been too easy :)
toad_ okay so here q(h) means q(h0) i.e. h = h0, max hops?
ArneBab jepp, and max hops sinks with falling timeout
toad_ hmm?
ArneBab the max actual hops
toad_ on the upside, q() is linear
-- Torgal (~Torgal@78.251.49.8) hat das Netzwerk verlassen (Ping timeout: 276
seconds)
toad_ hopefully
ArneBab yes: q(h) = h·o
ArneBab (in the congestion case)
toad_ the problem i have is it looks like q(1) ~= time [ for a full request
], unless load is very low
ArneBab so τ « q?
ArneBab τ much smaller than q?
toad_ of course it's bounded by timeouts, but i'd expect a runaway feedback
loop until it reaches heavy timeouts and effectively cuts the htl
toad_ well, with OLM, success time for a CHK is 1m25s, unsuccessful is
19sec, so transfer time is at least 1 minute
toad_ and less than 1m25; but with NLM, unsuccessful is 3 min+
ArneBab 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
Okay, I don't understand most of that, I might be able to check the math if it 
was written properly, but it looks difficult. However, as far as I can see:
- The most obvious way to increase bandwidth usage would be to increase the 
timeout time for output bandwidth liability (and at the same time increase the 
relevant block transfer timeouts).
- This would increase the number of slots but it would also increase the number 
of requests seeking them; I don't see why it would help matters.
- Running an excessive number of requests without adjusting the block transfer 
timeouts would result in some of the transfers timing out.
- It would also, as you mention, make SSK requests (especially failed SSK 
requests) even slower.
- I am quite confident that Moore's Law DOES NOT hold for upload bandwidth.
- As far as I can see, all the benefits of NLM re attackers are achieved by 
fair sharing.
- A system which is more resistant to attacks but slower probably isn't all 
that interesting if the attacks in question are relatively expensive anyway.
- Smaller block sizes would have a significant efficiency cost, and would 
probably make load management more difficult.

I apologise if you see this as rather a broadside after I encouraged you to 
analyse the problem, but it is not yet a convincing demonstration that queueing 
is a viable strategy and not the spawn of satan! :)

On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
 Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
  Yes, small tweaks have worked so well for us for the last decade, leaving us
  pretty-much where we were in 2003.  No, we don't understand how the current
  system works, there is no point in trying to fix something when we don't
  even know what is broken.
 
 I’d like to present a clue what is broken in NLM. Before I kill you with the 
 log, here’s the result: 
 
   With NLM the latency of a request is a function of the raw bandwidth 
   (not so with OLM), and NLM used only half my bandwidth after it had 
   been deployed for 2 days (at the start much more).
 
 τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time 
 (time to find the node), nlm: new load management, olm: old load management.
 
 So first step: make sure all bandwidth gets used - maybe by allocating more 
 slots till we use all allowed bandwidth. Better having to throttle a transfer 
 than not using bandwidth.
 
 *NLM should with the current network be slower than OLM by 23%. But in 18 
 months it should actually be faster by ~8% — given Moores Law holds for 
 upload 
 bandwidth — because the routes are shorter.*
 
 The main advantage of NLM is, that it should be much more resilient against 
 attackers (DoS).
 
 Now to the log - it’s math and not cleaned up; you have been warned :)
 
 ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. 
 ArneBab queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀, 
 w: success rate
 ArneBab ψs = τ(h) + q(h)
 ArneBab ψf = q(h)
 ArneBab ψ ~ w₁·ψs + (1-w₁)·ψf
 ArneBab σs = τ(h) + q(h)
 ArneBab σf = q(h)
 ArneBab σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
 ArneBab num(ψ) / num(σ) ~ 1
 ArneBab → time ~ σ + ψ
 ArneBab q(h) depends on timeouts, as do w₁ and w₂
 ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
 ArneBab = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
 w₂)·q(h)
 ArneBab = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
 ArneBab = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
 ArneBab in the congestion case q(h) ~ timeout
 ArneBab timeout = o
 ArneBab timeout: o
 ArneBab w depends on the timeout *somehow*, but inversely
 ArneBab o=0 → w=0
 ArneBab assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
 ArneBab assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
 ArneBab correction: in the congestion case: q(h) ~ min(timeout, τ(h))
 ArneBab timeout matters for q(h) only when timeout  τ(h)
 ArneBab I try to: I still need a dependency of w on timeout
 ArneBab … lets call it t(w)
 ArneBab better: w(o) :)
 toad_ well, if there is a timeout, we have a fixed time, but we reduce the 
 hops ...
 toad_ i thought w was success rate
 ArneBab ah! 
 ArneBab and the success rates where in the NLM stats
 ArneBab going mostly smoothly from 60% to 0%
 ArneBab for the HTL
 toad_ right, success rate peaks at 18 or sometimes 16
 toad_ what are w1 vs w2?
 toad_ chk vs ssk i guess
 ArneBab yes
 -*- toad_ thinks considering both is probably overambitious at this stage?
 ArneBab should not be too bad: SSKs drop much more rapidly at decreasing 
 hops
 ArneBab hops→HTL
 toad_ ψs is time for a successful chk; ψf is time for a failed chk ... in 
 which case h in the first instance is low, and in the second instance is h0
 ArneBab yes
 toad_ okay, i don't follow this line: time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + 
 (1-w₂)·ψf
 toad_ i thought w2 related to SSKs?
 ArneBab uh, yes…
 ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·σs + (1-w₂)·σf
 toad_ you have to appreciate i'm only just getting back into maths and 
 physics after 12 years ...
 toad_ 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Monday 29 Aug 2011 20:32:13 Ian Clarke wrote:
 On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland t...@amphibian.dyndns.org
  wrote:
 
  On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
   Ok, thinking about it, here is a proposal, or  rather, the beginning of a
   proposal.  I'm assuming that we get rid of NLM, fair sharing, and
  anything
   else intended to control load, and replace it with this.  We will
  absolutely
   need to simulate this before we write a single line of code to deploy it.
 
  I have justified fair sharing in my other reply. However alternate
  mechanisms might replace it.
 
  I agree we should simulate, although this might be significant work.
 
 
 I doubt it would be more work than trying it without simulation and failing
 for a decade, which was our previous approach.
 
 
   The core idea is that a node will include a floating point number in
   response to any kind of request showing how close that node is to being
   overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
   completely saturated and must reject requests.  Clearly the goal is to
  avoid
   getting anywhere near to 1.0.
  
   A node tracks several things:
 
  All of these apply to both locally originated requests and remotely
  originated requests? And are they for only our direct peers?
 
 
 Every peer along the reply path will take note of the load value, so it will
 be a mixture of the loads of close-by peers, and more distant peers.
  Obviously, since we are more likely to receive messages from peers that are
 closer, our average will automatically be biased towards closer peers.

So it is from the data source? Or every node along the way?
 
 - What is the average load reported by responses this node forwarded,
  per
  remote node
 
  Ahhh, this one could be interesting - you could use it to penalise nodes
  which spam excessively.
 
 Exactly.
 
   I think, given these metrics, we should be able to do the following:
  
  - Limit our overall rate of initiating local requests based on the
  global
  average reported load
 
  We can already do this on the basis of AIMD's, based purely on behaviour of
  local requests. However, taking more data into account might result in more
  accurate results - although it might also dilute the feedback.
 
 Right, AIMD tends to control the rate at which a specific event occurs, like
 a dropped IP packet with TCP, or a dropped request as we currently use it.
  This is a bit like signaling that you have a headache by banging your head
 against the wall.  It is also a rather coarse way to deliver feedback.

Fair point.
 
 A floating point load number conveys more information in a non-destructive
 way.
 
 In any case, this part is safe: it's not going to cause feedback problems,
  on its own. Conversely, it doesn't deal with problems such as network
  bottlenecks.
 
 Well, it depends.  If we are tracking load on a per-outbound-connection
 basis then we can potentially take action to manage load per-connection in
 addition to overall.

Not sure I follow here. Routing by load is bad, unless it is severely limited.
 
  However, while AIMD's are based on the *global* average load (or rather the
  load of all the nodes along the chain), it sounds like you are proposing to
  only report the *local* load of our direct peers? How are you going to
  compute the global average load?
 
 No, its not the local load of our direct peers.  Its the load reported in
 all reply messages that we forward, some of those will be from our direct
 peers, but most will be from remote peers.
 
  - Limit our rate of local requests based on the average load of the
  connection to the peer it would need to be forwarded to
 
  Not sure I follow here. Are you proposing that we initiate more requests in
  keyspace areas where there is lower load?
 
 Well, kinda, except the opposite.  We might reduce requests in keyspaces
 where there is higher load.
 
 
  - Detect when remote peers are abusing the system by disregarding load
  -
  as evidenced by a significantly higher average load in replies
  forwarded to
  them
 
  Okay, that is useful - although I doubt it would be an unambiguous
  detection, and what would we do about it? We would have to reject requests,
  and if we're not sure, or if it could be a downstream troublemaker, couldn't
  that result in misrouting and make things worse? On the other hand, any
  malicious node would be most easily detected by the nodes directly connected
  to it, so this could work much the same way as fair sharing - provided that
  it can reject or terminate the requests.
 
 No, we probably wouldn't start rejecting requests - more likely we would
 disconnect from that node.  Obviously we'd need to be fairly sure of abusive
 behavior before doing that - but abuse should be rare.

Well then we'd need to be very sure, and this means you could get away with a 
lot before triggering it. Wierd topologies for instance could cause 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Babenhauserheide
After discussing and going deeper into this, it became apparent that the
problem is not overload of queues. I’ll repeat the result I came to here (and
not in chatlog form :) ):

The problem is in the load limiter:


1) we need to use all bandwidth, because latency depends on the number of
finishing transfers.

2) SSKs fail 85% of the time.

3) the load limiter assumes that all SSKs succeed and reserves bandwidth for
them.

4) this now bites us, because NLM has shorter transfer times and longer wait
times. So the over-reservation of bandwidth for SSKs lasts longer.

5) solution: count each SSK as only
average_SSK_success_rate * data_to_transfer_on_success.


To (1): When we have less bandwidth, than less CHKs succeed per second, so
less slots get freed and the queueing time it bigger.

Best wishes,
Arne



Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
 Okay, I don't understand most of that, I might be able to check the math if
 it was written properly, but it looks difficult. However, as far as I can
 see: - The most obvious way to increase bandwidth usage would be to
 increase the timeout time for output bandwidth liability (and at the same
 time increase the relevant block transfer timeouts). - This would increase
 the number of slots but it would also increase the number of requests
 seeking them; I don't see why it would help matters. - Running an excessive
 number of requests without adjusting the block transfer timeouts would
 result in some of the transfers timing out. - It would also, as you
 mention, make SSK requests (especially failed SSK requests) even slower. -
 I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
 As far as I can see, all the benefits of NLM re attackers are achieved by
 fair sharing. - A system which is more resistant to attacks but slower
 probably isn't all that interesting if the attacks in question are
 relatively expensive anyway. - Smaller block sizes would have a significant
 efficiency cost, and would probably make load management more difficult.

 I apologise if you see this as rather a broadside after I encouraged you to
 analyse the problem, but it is not yet a convincing demonstration that
 queueing is a viable strategy and not the spawn of satan! :)
 On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
  Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
   Yes, small tweaks have worked so well for us for the last decade,
   leaving us pretty-much where we were in 2003.  No, we don't
   understand how the current system works, there is no point in
   trying to fix something when we don't even know what is broken.
 
  I’d like to present a clue what is broken in NLM. Before I kill you with
  the
  log, here’s the result:
  With NLM the latency of a request is a function of the raw bandwidth
  (not so with OLM), and NLM used only half my bandwidth after it had
  been deployed for 2 days (at the start much more).
 
  τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue
  time (time to find the node), nlm: new load management, olm: old load
  management.
 
  So first step: make sure all bandwidth gets used - maybe by allocating
  more slots till we use all allowed bandwidth. Better having to throttle
  a transfer than not using bandwidth.
 
  *NLM should with the current network be slower than OLM by 23%. But in
  18
  months it should actually be faster by ~8% — given Moores Law holds for
  upload bandwidth — because the routes are shorter.*
 
  The main advantage of NLM is, that it should be much more resilient
  against attackers (DoS).
 
  Now to the log - it’s math and not cleaned up; you have been warned :)
 
  ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf.
  ArneBab queue-time: q, transfer-time: τ, hops remaining: h, total
  hops: h₀, w: success rate
  ArneBab ψs = τ(h) + q(h)
  ArneBab ψf = q(h)
  ArneBab ψ ~ w₁·ψs + (1-w₁)·ψf
  ArneBab σs = τ(h) + q(h)
  ArneBab σf = q(h)
  ArneBab σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
  ArneBab num(ψ) / num(σ) ~ 1
  ArneBab → time ~ σ + ψ
  ArneBab q(h) depends on timeouts, as do w₁ and w₂
  ArneBab time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
  ArneBab = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
  w₂)·q(h)
  ArneBab = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
  ArneBab = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
  ArneBab in the congestion case q(h) ~ timeout
  ArneBab timeout = o
  ArneBab timeout: o
  ArneBab w depends on the timeout *somehow*, but inversely
  ArneBab o=0 → w=0
  ArneBab assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
  ArneBab assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
  ArneBab correction: in the congestion case: q(h) ~ min(timeout, τ(h))
  ArneBab timeout matters for q(h) only when timeout  τ(h)
  ArneBab I try to: I still need a dependency of w on timeout
  ArneBab … lets call it t(w)
  ArneBab better: w(o) :)
  toad_ well, if there is a timeout, we have a fixed time, but we reduce
  

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
 After discussing and going deeper into this, it became apparent that the 
 problem is not overload of queues. I’ll repeat the result I came to here (and 
 not in chatlog form :) ): 
 
 The problem is in the load limiter: 
 
 
 1) we need to use all bandwidth, because latency depends on the number of  
   finishing transfers.
 
 2) SSKs fail 85% of the time.
 
 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for 
   them.
 
 4) this now bites us, because NLM has shorter transfer times and longer wait 
   times. So the over-reservation of bandwidth for SSKs lasts longer. 
 
 5) solution: count each SSK as only 
   average_SSK_success_rate * data_to_transfer_on_success.
 
 
 To (1): When we have less bandwidth, than less CHKs succeed per second, so 
 less slots get freed and the queueing time it bigger.

Another way to look at this:

Both me and Ian have been assuming that the resource is some number 
representing a property of the node, for example, how many requests are 
running. But the real resource is the external bandwidth usage, which is rather 
different.

An alternative solution - if we are worried about bursts of requests succeeding 
and therefore timing out - is to allocate a fixed chunk of bandwidth for SSKs, 
and not let them interfere with CHKs. Then CHKs can have long transfer times, 
and therefore long queueing times when they fail (but this is only ~ 40% of the 
time with NLM's accurate routing), and SSKs can have short transfer times, and 
therefore short queueing times. This may be less disruptive and should also be 
relatively easy to implement.

But the other question is, can queueing ever be helpful? It can if it allows us 
to route more accurately (which NLM clearly does), and/or to run enough 
requests in parallel that the longer time taken for the request to reach its 
destination is offset. Is this condition met?
 
 Best wishes, 
 Arne
 
 
 
 Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
  Okay, I don't understand most of that, I might be able to check the math if
  it was written properly, but it looks difficult. However, as far as I can
  see: - The most obvious way to increase bandwidth usage would be to
  increase the timeout time for output bandwidth liability (and at the same
  time increase the relevant block transfer timeouts). - This would increase
  the number of slots but it would also increase the number of requests
  seeking them; I don't see why it would help matters. - Running an excessive
  number of requests without adjusting the block transfer timeouts would
  result in some of the transfers timing out. - It would also, as you
  mention, make SSK requests (especially failed SSK requests) even slower. -
  I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
  As far as I can see, all the benefits of NLM re attackers are achieved by
  fair sharing. - A system which is more resistant to attacks but slower
  probably isn't all that interesting if the attacks in question are
  relatively expensive anyway. - Smaller block sizes would have a significant
  efficiency cost, and would probably make load management more difficult.
  
  I apologise if you see this as rather a broadside after I encouraged you to
  analyse the problem, but it is not yet a convincing demonstration that
  queueing is a viable strategy and not the spawn of satan! :)
  On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
   Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
Yes, small tweaks have worked so well for us for the last decade,
leaving us pretty-much where we were in 2003.  No, we don't
understand how the current system works, there is no point in
trying to fix something when we don't even know what is broken.
   
   I’d like to present a clue what is broken in NLM. Before I kill you with
   the 
   log, here’s the result:
 With NLM the latency of a request is a function of the raw bandwidth
 (not so with OLM), and NLM used only half my bandwidth after it had
 been deployed for 2 days (at the start much more).
   
   τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue
   time (time to find the node), nlm: new load management, olm: old load
   management.
   
   So first step: make sure all bandwidth gets used - maybe by allocating
   more slots till we use all allowed bandwidth. Better having to throttle
   a transfer than not using bandwidth.
   
   *NLM should with the current network be slower than OLM by 23%. But in
   18
   months it should actually be faster by ~8% — given Moores Law holds for
   upload bandwidth — because the routes are shorter.*
   
   The main advantage of NLM is, that it should be much more resilient
   against attackers (DoS).
   
   Now to the log - it’s math and not cleaned up; you have been warned :)
   
   ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Matthew Toseland
On Tuesday 30 Aug 2011 00:18:35 Matthew Toseland wrote:
 On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
  After discussing and going deeper into this, it became apparent that the 
  problem is not overload of queues. I’ll repeat the result I came to here 
  (and 
  not in chatlog form :) ): 
  
  The problem is in the load limiter: 
  
  
  1) we need to use all bandwidth, because latency depends on the number of  
  finishing transfers.
  
  2) SSKs fail 85% of the time.
  
  3) the load limiter assumes that all SSKs succeed and reserves bandwidth 
  for 
  them.
  
  4) this now bites us, because NLM has shorter transfer times and longer 
  wait 
  times. So the over-reservation of bandwidth for SSKs lasts longer. 
  
  5) solution: count each SSK as only 
  average_SSK_success_rate * data_to_transfer_on_success.
  
  
  To (1): When we have less bandwidth, than less CHKs succeed per second, so 
  less slots get freed and the queueing time it bigger.
 
 Another way to look at this:
 
 Both me and Ian have been assuming that the resource is some number 
 representing a property of the node, for example, how many requests are 
 running. But the real resource is the external bandwidth usage, which is 
 rather different.
 
 An alternative solution - if we are worried about bursts of requests 
 succeeding and therefore timing out - is to allocate a fixed chunk of 
 bandwidth for SSKs, and not let them interfere with CHKs. Then CHKs can have 
 long transfer times, and therefore long queueing times when they fail (but 
 this is only ~ 40% of the time with NLM's accurate routing), and SSKs can 
 have short transfer times, and therefore short queueing times. This may be 
 less disruptive and should also be relatively easy to implement.
 
 But the other question is, can queueing ever be helpful? It can if it allows 
 us to route more accurately (which NLM clearly does), and/or to run enough 
 requests in parallel that the longer time taken for the request to reach its 
 destination is offset. Is this condition met?

Yes it is!

If a request takes P seconds without queueing, and P+Q seconds with queueing, 
then if we have the same number of requests running at once, we have a slowdown 
of (P+Q)/P. However, if we start (P+Q)/P times more requests, and the queueing 
time remains constant, we can compensate for this. 

Does it? 

There are two main components:
1) The queueing time. This is roughly proportional to (number of requests 
running) / (number of slots available). If we increase the capacity of the 
node, we multiply both the top and bottom by the same factor - so the queueing 
time is unchanged, unless the data transfer time is changed.
2) The data transfer time. In the above case with a slowdown, the data transfer 
time is the same, or a little shorter, because most of our bandwidth isn't 
being used. If we increase it by running more requests at once, it should reach 
where it was, and if we increase it further, transfers will start to take 
longer again.

ArneBab has analysed this in more detail and shown that the largest component 
of the queueing time is proportional to the data transfer time, hence, 
separating SSKs (which have a very short transfer time) from CHKs (which have a 
much longer transfer time) would help significantly.

The key advantage of NLM, of course, is that it routes accurately, resulting in 
10%-30% improvements in success rates, and transferring data over significantly 
fewer hops (= more effective bandwidth).
  
  Best wishes, 
  Arne
  
  
  
  Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
   Okay, I don't understand most of that, I might be able to check the math 
   if
   it was written properly, but it looks difficult. However, as far as I can
   see: - The most obvious way to increase bandwidth usage would be to
   increase the timeout time for output bandwidth liability (and at the same
   time increase the relevant block transfer timeouts). - This would increase
   the number of slots but it would also increase the number of requests
   seeking them; I don't see why it would help matters. - Running an 
   excessive
   number of requests without adjusting the block transfer timeouts would
   result in some of the transfers timing out. - It would also, as you
   mention, make SSK requests (especially failed SSK requests) even slower. -
   I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. 
   -
   As far as I can see, all the benefits of NLM re attackers are achieved by
   fair sharing. - A system which is more resistant to attacks but slower
   probably isn't all that interesting if the attacks in question are
   relatively expensive anyway. - Smaller block sizes would have a 
   significant
   efficiency cost, and would probably make load management more difficult.
   
   I apologise if you see this as rather a broadside after I encouraged you 
   to
   analyse the problem, but it is not yet a 

Re: [freenet-dev] Beyond New Load Management: A proposal

2011-08-29 Thread Arne Bab.
Von: Matthew Toseland t...@amphibian.dyndns.org   
But the other question is, can queueing ever be helpful? It can if it allows 
us to route more accurately (which NLM clearly does), and/or to run enough 
requests in parallel that the longer time taken for the request to reach its 
destination is offset. Is this condition met?

Experience with the deployed NLM showed that even in the fully congested case 
it had success rates of 60% for HTL 18,17 and 16, compared to less than 40% for 
OLM. This means that the requests are sent over fewer hops on average, because 
find the content fewer hops away from the requester.

A download of 1MiB which is sent over 2 hops needs 2 MiB in total network 
bandwidth.
If it is sent over only 1.5 hops on average, then it needs only 1.5 MiB total 
network bandwidth.

So essentially NLM can distribute 30% more content with the same network 
resources¹. And these numbers are actual observations. The only reason why this 
did not result in increased performance is that the nodes used less than 50% of 
their allocated bandwidth² - which is a problem with the bandwidth scheduler 
and not with queueing.

Best wishes,
Arne

¹: The relevant network resource is upload bandwidth.
²: Source: observations from me and two other freenet users.

PS: How exactly the bandwidth limiter is fixed is an implementation detail. I 
think you are actually the only person who can judge how to do this most 
efficiently.

___
Schon gehört? WEB.DE hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://produkte.web.de/go/toolbar


smime.p7s
Description: S/MIME Cryptographic Signature
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] Beyond New Load Management

2011-08-27 Thread Matthew Toseland
On Saturday 27 Aug 2011 15:43:53 Matthew Toseland wrote:
> After trying out New Load Management on the network and seeing rather bad 
> results, we need to reconsider load management. IMHO Old Load Management (the 
> current system) is still not an acceptable answer.
> 
> Ideal load management would:
> - PERFORMANCE: Performance is the number of requests running in parallel, 
> divided by the average time taken. This means latency should only be 
> increased if we increase the number of requests running in parallel by a 
> similar factor. We want to achieve close to the ideal for *both* successful 
> and unsuccessful requests. Our capacity for unsuccessful requests is huge, 
> but we don't know whether they are going to succeed when we send them, which 
> creates some problems.
> - LATENCY: Not increase latency significantly for realtime requests. These 
> must respond quickly, even at the expense of somewhat poorer routing.
> - REACTIVITY: React quickly to variation in available resources (e.g. some 
> requests completing quickly), but not overshoot.
> - ACCURACY: Route accurately, severely limiting misrouting, but avoid 
> excessively slow nodes.
> - INCENTIVES: Be incentives-compatible and secure: Ideally, the originator 
> should not be special.
> - DOS: Make DoS attacks hard, dependant on the number of connections you have 
> into the network (hence very hard on darknet and hopefully similar difficulty 
> to global surveillance on opennet).
> 
> NEW LOAD MANAGEMENT:
> 
> So far, NLM seems to have problems. Ian argues these are largely due to 
> queueing making a bad situation worse - queueing causes requests to take 
> longer, the only way to counteract this is to run more requests in parallel, 
> but that ALSO causes queueing to take longer...
> 
> Performance: So far poor
> Latency: Very poor
> Reactivity: Should be reasonable, not clear
> Accuracy: Very good
> Incentives: Good, the originator is not special
> DoS: Good, thanks mainly to fair sharing (see below)
> 
> OLD LOAD MANAGEMENT:
> 
> AIMDs on the originator based on RejectedOverload's, which are generated when 
> a request is rejected, and passed all the way back to the originator. When 
> the RejectedOverload is originally generated, the peer in question gets 
> backed off.
> 
> Performance: Moderate
> Latency: Good
> Reactivity: Poor
> Accuracy: Poor (all the backoffs)
> Incentives: Poor (the originator is special; sending loads of requests can 
> improve performance at the cost of the network)
> DoS: Poor (nothing to stop you ignoring AIMDs, sending loads of requests and 
> causing lots of backoffs)
> 
> OLD LOAD MANAGEMENT + FAIR SHARING:
> 
> Fair sharing between peers greatly reduces our vulnerability to DoS, and 
> improves performance on nodes with relatively few peers. However, current 
> fair sharing includes an abrupt transition which can cause backoffs, and 
> makes the next item harder. This should be fixed soon.
> 
> Incentives: Better but the originator is still special
> DoS: Moderate
> 
> IMPROVED AIMD'S:
> 
> We can make the AIMD's be a true request count window rather than a rate 
> estimator. This should respond faster to variations in retrieval times (e.g. 
> getting a bunch of offered keys). We should probably not multiply it by the 
> number of peers as we do now, and we should probably consider how sensitive 
> we want the AIMD's to be (I'm not sure how we would easily calirbate that).
> 
> Performance: Should be improved a bit.
> Reactivity: Definitely improved a bit.
> 
> EARLY "SLOW DOWN" MESSAGES:
> 
> Ian has proposed that we send some sort of "slow down" message when we are 
> over some load threshold but still able to accept requests. This could be 
> implemented by sending a non-local RejectedOverload (i.e. pretending we are 
> relaying it), but ideally we'd like to have two distinct messages so we can 
> tell what proportion of requests receive each. One problem is given there is 
> a time lag involved, we could get oscillations and still see too many 
> rejections.

One difficulty here is should the threshold be on the per-peer limit (fair 
sharing) or just on the total load. If it's on the per-peer limit the limit is 
often going to be rather small...
> 
> Performance: Should be improved
> Accuracy: Significantly better
> DoS: Unaffected but need to figure out how to deal with peers that 
> consistently send slow-down messages.
> 
> AIMD'S ON EACH NODE INCLUDING REMOTE REQUESTS:
> 
> I propose that we could keep a rate estimation on each node, and use it not 
> only for our requests but for all requests, based on whether requests 
> complete with or without a slow-down message. This would determine an upper 
> and lower bound. Above the upper bound, we would reject requests. Above the 
> lower bound, we would send slow-down messages, but still accept requests; 
> below the lower bound, we would accept requests. This would also determine 
> when we start local requests. The main risk here is that we could 

[freenet-dev] Beyond New Load Management

2011-08-27 Thread Matthew Toseland
After trying out New Load Management on the network and seeing rather bad 
results, we need to reconsider load management. IMHO Old Load Management (the 
current system) is still not an acceptable answer.

Ideal load management would:
- PERFORMANCE: Performance is the number of requests running in parallel, 
divided by the average time taken. This means latency should only be increased 
if we increase the number of requests running in parallel by a similar factor. 
We want to achieve close to the ideal for *both* successful and unsuccessful 
requests. Our capacity for unsuccessful requests is huge, but we don't know 
whether they are going to succeed when we send them, which creates some 
problems.
- LATENCY: Not increase latency significantly for realtime requests. These must 
respond quickly, even at the expense of somewhat poorer routing.
- REACTIVITY: React quickly to variation in available resources (e.g. some 
requests completing quickly), but not overshoot.
- ACCURACY: Route accurately, severely limiting misrouting, but avoid 
excessively slow nodes.
- INCENTIVES: Be incentives-compatible and secure: Ideally, the originator 
should not be special.
- DOS: Make DoS attacks hard, dependant on the number of connections you have 
into the network (hence very hard on darknet and hopefully similar difficulty 
to global surveillance on opennet).

NEW LOAD MANAGEMENT:

So far, NLM seems to have problems. Ian argues these are largely due to 
queueing making a bad situation worse - queueing causes requests to take 
longer, the only way to counteract this is to run more requests in parallel, 
but that ALSO causes queueing to take longer...

Performance: So far poor
Latency: Very poor
Reactivity: Should be reasonable, not clear
Accuracy: Very good
Incentives: Good, the originator is not special
DoS: Good, thanks mainly to fair sharing (see below)

OLD LOAD MANAGEMENT:

AIMDs on the originator based on RejectedOverload's, which are generated when a 
request is rejected, and passed all the way back to the originator. When the 
RejectedOverload is originally generated, the peer in question gets backed off.

Performance: Moderate
Latency: Good
Reactivity: Poor
Accuracy: Poor (all the backoffs)
Incentives: Poor (the originator is special; sending loads of requests can 
improve performance at the cost of the network)
DoS: Poor (nothing to stop you ignoring AIMDs, sending loads of requests and 
causing lots of backoffs)

OLD LOAD MANAGEMENT + FAIR SHARING:

Fair sharing between peers greatly reduces our vulnerability to DoS, and 
improves performance on nodes with relatively few peers. However, current fair 
sharing includes an abrupt transition which can cause backoffs, and makes the 
next item harder. This should be fixed soon.

Incentives: Better but the originator is still special
DoS: Moderate

IMPROVED AIMD'S:

We can make the AIMD's be a true request count window rather than a rate 
estimator. This should respond faster to variations in retrieval times (e.g. 
getting a bunch of offered keys). We should probably not multiply it by the 
number of peers as we do now, and we should probably consider how sensitive we 
want the AIMD's to be (I'm not sure how we would easily calirbate that).

Performance: Should be improved a bit.
Reactivity: Definitely improved a bit.

EARLY "SLOW DOWN" MESSAGES:

Ian has proposed that we send some sort of "slow down" message when we are over 
some load threshold but still able to accept requests. This could be 
implemented by sending a non-local RejectedOverload (i.e. pretending we are 
relaying it), but ideally we'd like to have two distinct messages so we can 
tell what proportion of requests receive each. One problem is given there is a 
time lag involved, we could get oscillations and still see too many rejections.

Performance: Should be improved
Accuracy: Significantly better
DoS: Unaffected but need to figure out how to deal with peers that consistently 
send slow-down messages.

AIMD'S ON EACH NODE INCLUDING REMOTE REQUESTS:

I propose that we could keep a rate estimation on each node, and use it not 
only for our requests but for all requests, based on whether requests complete 
with or without a slow-down message. This would determine an upper and lower 
bound. Above the upper bound, we would reject requests. Above the lower bound, 
we would send slow-down messages, but still accept requests; below the lower 
bound, we would accept requests. This would also determine when we start local 
requests. The main risk here is that we could get some sort of feedback loop 
situation. It probably would need to be simulated and we might need to find a 
better algorithm, or tune the existing one, for calculating the rate. Also, 
like with fair sharing, there is a time lag involved in telling peers to slow 
down.

Performance: Unclear, should be similar if not better
Reactivity: Should be reasonable, possibly better than NLM as it should be able 
to deal with bottlenecks better

[freenet-dev] Beyond New Load Management

2011-08-27 Thread Ian Clarke
Matthew,

This makes sense from the perspective of making incremental changes, but I
think we need to be more drastic than that.  I think we need to go back to
the drawing board with load management.  We need to find a solution that is
simple enough to reason about, and to debug if we have problems with it.

The entire approach of coming up with hypotheses about what is wrong,
building a solution based on these hypotheses (without actually confirming
that the hypotheses are accurate) and deploying it is deja vu, we've been
doing it for a decade, and we still haven't got load management right.
 We're just layering more complexity onto a system that we already don't
understand, based on guesses as to what the problems were with the previous
iteration that we can't test because the system is too complicated with too
many interactions for anyone to get their heads around it.

If something isn't working, and we don't understand for definite what is
wrong with it, then we shouldn't build more on top of it in the hope that
we'll accidentally solve the problem, we should replace it with something
simple enough that we do understand it, right?

The purpose of load management is relatively simple: *Don't allow clients to
pump more requests into the network than the network can handle, while
ensuring that this workload is distributed across the network in a
reasonably efficient manner.  This must be done in a decentralized way that
is not vulnerable to abuse.*
*
*
The current load management system includes many different interacting
components that make it nearly impossible to understand or debug.  I think
we need to go back to the drawing board with load management, starting from
the goal I cite above.

I would invite people to suggest the simplest possible load management
schemes that might work, let's then discuss them, and figure out which is
most likely to work, and if it doesn't work, which will be easiest to debug.

We can bear in mind a few lesson's we've learned though.  What
characteristics should our load balancing system have?

   - Dropping requests should be extremely rare because this
   just exacerbates overloading
   - Delaying requests should also be extremely rare for the same reason
   - Misrouting requests should be limited, but perhaps acceptable
   occasionally, for the same reason

It therefore really comes down to nodes anticipating whether the network is
in danger of having to drop, delay, or misroute requests, and reduce the
rate at which they are pumping requests into the network if this danger
point is reached.

So how do nodes inform each-other that they are at risk of having to drop,
delay, or misroute requests?  There are two reasons this might happen.
 Either a node itself is close to its own capacity for relaying requests, or
the other nodes it relays to are themselves at or close to capacity.

One problem I'm concerned about when nodes share information about how
overloaded their peers are is a "gridlock":

What if all nodes told other nodes that they were overloaded because all
their peers are overloaded.  Such a situation would basically cause the
entire network to think it was overloaded, even though nobody actually was!
 It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can this
be avoided?

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: ian at freenetproject.org
-- next part --
An HTML attachment was scrubbed...
URL: 



[freenet-dev] Beyond New Load Management

2011-08-27 Thread freenet.10.technomat...@recursor.net
I'm glad to see that the subject of complexity has come up, and if I
can speak to in a more general way...

Complexity is insidious - you start with a simple idea, the creativity
flows and over time, you are wedded to a highly coupled, inflexible
and obscure beast that is hard to distance yourself from and see for
what it is - due for a rewrite and an infusion of new and current
tools and techniques.

With current tools and understanding way ahead of where they were a
decade ago, perhaps it is time to distill the knowledge gained so far
and pour it into a more modular build based on test driven
development, where the potentially highly valuable learning that
Freenet can offer could also potentially be reused by other projects,
if the want exists, for example to release Freenet from the server/pc
realm and out to be used in the current decade of mobile and tablet
devices.

With a robust framework for modeling real-world activity that can be
directed at portions of the Freenet app, such as load management, used
to drive demonstrable unit, functional and performance testing, the
volatility of the app can be reduced, as would be the need the
incessant, off-schedule mode of releasing updates. I am sure that a
more measured and planned mode of building out the app, as is seen in
the well regarded projects from the Apache foundation for example,
would reduce the pain and stress for developer and user alike, and
extend the life and impact of the project.

With the utmost regard for everyone who contributes to the project,

Sebastian Weetabix.




[freenet-dev] Beyond New Load Management

2011-08-27 Thread Matthew Toseland
After trying out New Load Management on the network and seeing rather bad 
results, we need to reconsider load management. IMHO Old Load Management (the 
current system) is still not an acceptable answer.

Ideal load management would:
- PERFORMANCE: Performance is the number of requests running in parallel, 
divided by the average time taken. This means latency should only be increased 
if we increase the number of requests running in parallel by a similar factor. 
We want to achieve close to the ideal for *both* successful and unsuccessful 
requests. Our capacity for unsuccessful requests is huge, but we don't know 
whether they are going to succeed when we send them, which creates some 
problems.
- LATENCY: Not increase latency significantly for realtime requests. These must 
respond quickly, even at the expense of somewhat poorer routing.
- REACTIVITY: React quickly to variation in available resources (e.g. some 
requests completing quickly), but not overshoot.
- ACCURACY: Route accurately, severely limiting misrouting, but avoid 
excessively slow nodes.
- INCENTIVES: Be incentives-compatible and secure: Ideally, the originator 
should not be special.
- DOS: Make DoS attacks hard, dependant on the number of connections you have 
into the network (hence very hard on darknet and hopefully similar difficulty 
to global surveillance on opennet).

NEW LOAD MANAGEMENT:

So far, NLM seems to have problems. Ian argues these are largely due to 
queueing making a bad situation worse - queueing causes requests to take 
longer, the only way to counteract this is to run more requests in parallel, 
but that ALSO causes queueing to take longer...

Performance: So far poor
Latency: Very poor
Reactivity: Should be reasonable, not clear
Accuracy: Very good
Incentives: Good, the originator is not special
DoS: Good, thanks mainly to fair sharing (see below)

OLD LOAD MANAGEMENT:

AIMDs on the originator based on RejectedOverload's, which are generated when a 
request is rejected, and passed all the way back to the originator. When the 
RejectedOverload is originally generated, the peer in question gets backed off.

Performance: Moderate
Latency: Good
Reactivity: Poor
Accuracy: Poor (all the backoffs)
Incentives: Poor (the originator is special; sending loads of requests can 
improve performance at the cost of the network)
DoS: Poor (nothing to stop you ignoring AIMDs, sending loads of requests and 
causing lots of backoffs)

OLD LOAD MANAGEMENT + FAIR SHARING:

Fair sharing between peers greatly reduces our vulnerability to DoS, and 
improves performance on nodes with relatively few peers. However, current fair 
sharing includes an abrupt transition which can cause backoffs, and makes the 
next item harder. This should be fixed soon.

Incentives: Better but the originator is still special
DoS: Moderate

IMPROVED AIMD'S:

We can make the AIMD's be a true request count window rather than a rate 
estimator. This should respond faster to variations in retrieval times (e.g. 
getting a bunch of offered keys). We should probably not multiply it by the 
number of peers as we do now, and we should probably consider how sensitive we 
want the AIMD's to be (I'm not sure how we would easily calirbate that).

Performance: Should be improved a bit.
Reactivity: Definitely improved a bit.

EARLY SLOW DOWN MESSAGES:

Ian has proposed that we send some sort of slow down message when we are over 
some load threshold but still able to accept requests. This could be 
implemented by sending a non-local RejectedOverload (i.e. pretending we are 
relaying it), but ideally we'd like to have two distinct messages so we can 
tell what proportion of requests receive each. One problem is given there is a 
time lag involved, we could get oscillations and still see too many rejections.

Performance: Should be improved
Accuracy: Significantly better
DoS: Unaffected but need to figure out how to deal with peers that consistently 
send slow-down messages.

AIMD'S ON EACH NODE INCLUDING REMOTE REQUESTS:

I propose that we could keep a rate estimation on each node, and use it not 
only for our requests but for all requests, based on whether requests complete 
with or without a slow-down message. This would determine an upper and lower 
bound. Above the upper bound, we would reject requests. Above the lower bound, 
we would send slow-down messages, but still accept requests; below the lower 
bound, we would accept requests. This would also determine when we start local 
requests. The main risk here is that we could get some sort of feedback loop 
situation. It probably would need to be simulated and we might need to find a 
better algorithm, or tune the existing one, for calculating the rate. Also, 
like with fair sharing, there is a time lag involved in telling peers to slow 
down.

Performance: Unclear, should be similar if not better
Reactivity: Should be reasonable, possibly better than NLM as it should be able 
to deal with bottlenecks better
Incentives: 

Re: [freenet-dev] Beyond New Load Management

2011-08-27 Thread Matthew Toseland
On Saturday 27 Aug 2011 15:43:53 Matthew Toseland wrote:
 After trying out New Load Management on the network and seeing rather bad 
 results, we need to reconsider load management. IMHO Old Load Management (the 
 current system) is still not an acceptable answer.
 
 Ideal load management would:
 - PERFORMANCE: Performance is the number of requests running in parallel, 
 divided by the average time taken. This means latency should only be 
 increased if we increase the number of requests running in parallel by a 
 similar factor. We want to achieve close to the ideal for *both* successful 
 and unsuccessful requests. Our capacity for unsuccessful requests is huge, 
 but we don't know whether they are going to succeed when we send them, which 
 creates some problems.
 - LATENCY: Not increase latency significantly for realtime requests. These 
 must respond quickly, even at the expense of somewhat poorer routing.
 - REACTIVITY: React quickly to variation in available resources (e.g. some 
 requests completing quickly), but not overshoot.
 - ACCURACY: Route accurately, severely limiting misrouting, but avoid 
 excessively slow nodes.
 - INCENTIVES: Be incentives-compatible and secure: Ideally, the originator 
 should not be special.
 - DOS: Make DoS attacks hard, dependant on the number of connections you have 
 into the network (hence very hard on darknet and hopefully similar difficulty 
 to global surveillance on opennet).
 
 NEW LOAD MANAGEMENT:
 
 So far, NLM seems to have problems. Ian argues these are largely due to 
 queueing making a bad situation worse - queueing causes requests to take 
 longer, the only way to counteract this is to run more requests in parallel, 
 but that ALSO causes queueing to take longer...
 
 Performance: So far poor
 Latency: Very poor
 Reactivity: Should be reasonable, not clear
 Accuracy: Very good
 Incentives: Good, the originator is not special
 DoS: Good, thanks mainly to fair sharing (see below)
 
 OLD LOAD MANAGEMENT:
 
 AIMDs on the originator based on RejectedOverload's, which are generated when 
 a request is rejected, and passed all the way back to the originator. When 
 the RejectedOverload is originally generated, the peer in question gets 
 backed off.
 
 Performance: Moderate
 Latency: Good
 Reactivity: Poor
 Accuracy: Poor (all the backoffs)
 Incentives: Poor (the originator is special; sending loads of requests can 
 improve performance at the cost of the network)
 DoS: Poor (nothing to stop you ignoring AIMDs, sending loads of requests and 
 causing lots of backoffs)
 
 OLD LOAD MANAGEMENT + FAIR SHARING:
 
 Fair sharing between peers greatly reduces our vulnerability to DoS, and 
 improves performance on nodes with relatively few peers. However, current 
 fair sharing includes an abrupt transition which can cause backoffs, and 
 makes the next item harder. This should be fixed soon.
 
 Incentives: Better but the originator is still special
 DoS: Moderate
 
 IMPROVED AIMD'S:
 
 We can make the AIMD's be a true request count window rather than a rate 
 estimator. This should respond faster to variations in retrieval times (e.g. 
 getting a bunch of offered keys). We should probably not multiply it by the 
 number of peers as we do now, and we should probably consider how sensitive 
 we want the AIMD's to be (I'm not sure how we would easily calirbate that).
 
 Performance: Should be improved a bit.
 Reactivity: Definitely improved a bit.
 
 EARLY SLOW DOWN MESSAGES:
 
 Ian has proposed that we send some sort of slow down message when we are 
 over some load threshold but still able to accept requests. This could be 
 implemented by sending a non-local RejectedOverload (i.e. pretending we are 
 relaying it), but ideally we'd like to have two distinct messages so we can 
 tell what proportion of requests receive each. One problem is given there is 
 a time lag involved, we could get oscillations and still see too many 
 rejections.

One difficulty here is should the threshold be on the per-peer limit (fair 
sharing) or just on the total load. If it's on the per-peer limit the limit is 
often going to be rather small...
 
 Performance: Should be improved
 Accuracy: Significantly better
 DoS: Unaffected but need to figure out how to deal with peers that 
 consistently send slow-down messages.
 
 AIMD'S ON EACH NODE INCLUDING REMOTE REQUESTS:
 
 I propose that we could keep a rate estimation on each node, and use it not 
 only for our requests but for all requests, based on whether requests 
 complete with or without a slow-down message. This would determine an upper 
 and lower bound. Above the upper bound, we would reject requests. Above the 
 lower bound, we would send slow-down messages, but still accept requests; 
 below the lower bound, we would accept requests. This would also determine 
 when we start local requests. The main risk here is that we could get some 
 sort of feedback loop situation. It probably would need to be simulated and 
 we might need 

Re: [freenet-dev] Beyond New Load Management

2011-08-27 Thread Ian Clarke
Matthew,

This makes sense from the perspective of making incremental changes, but I
think we need to be more drastic than that.  I think we need to go back to
the drawing board with load management.  We need to find a solution that is
simple enough to reason about, and to debug if we have problems with it.

The entire approach of coming up with hypotheses about what is wrong,
building a solution based on these hypotheses (without actually confirming
that the hypotheses are accurate) and deploying it is deja vu, we've been
doing it for a decade, and we still haven't got load management right.
 We're just layering more complexity onto a system that we already don't
understand, based on guesses as to what the problems were with the previous
iteration that we can't test because the system is too complicated with too
many interactions for anyone to get their heads around it.

If something isn't working, and we don't understand for definite what is
wrong with it, then we shouldn't build more on top of it in the hope that
we'll accidentally solve the problem, we should replace it with something
simple enough that we do understand it, right?

The purpose of load management is relatively simple: *Don't allow clients to
pump more requests into the network than the network can handle, while
ensuring that this workload is distributed across the network in a
reasonably efficient manner.  This must be done in a decentralized way that
is not vulnerable to abuse.*
*
*
The current load management system includes many different interacting
components that make it nearly impossible to understand or debug.  I think
we need to go back to the drawing board with load management, starting from
the goal I cite above.

I would invite people to suggest the simplest possible load management
schemes that might work, let's then discuss them, and figure out which is
most likely to work, and if it doesn't work, which will be easiest to debug.

We can bear in mind a few lesson's we've learned though.  What
characteristics should our load balancing system have?

   - Dropping requests should be extremely rare because this
   just exacerbates overloading
   - Delaying requests should also be extremely rare for the same reason
   - Misrouting requests should be limited, but perhaps acceptable
   occasionally, for the same reason

It therefore really comes down to nodes anticipating whether the network is
in danger of having to drop, delay, or misroute requests, and reduce the
rate at which they are pumping requests into the network if this danger
point is reached.

So how do nodes inform each-other that they are at risk of having to drop,
delay, or misroute requests?  There are two reasons this might happen.
 Either a node itself is close to its own capacity for relaying requests, or
the other nodes it relays to are themselves at or close to capacity.

One problem I'm concerned about when nodes share information about how
overloaded their peers are is a gridlock:

What if all nodes told other nodes that they were overloaded because all
their peers are overloaded.  Such a situation would basically cause the
entire network to think it was overloaded, even though nobody actually was!
 It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can this
be avoided?

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: i...@freenetproject.org
___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management

2011-08-27 Thread freenet . 10 . technomation
I'm glad to see that the subject of complexity has come up, and if I
can speak to in a more general way...

Complexity is insidious - you start with a simple idea, the creativity
flows and over time, you are wedded to a highly coupled, inflexible
and obscure beast that is hard to distance yourself from and see for
what it is - due for a rewrite and an infusion of new and current
tools and techniques.

With current tools and understanding way ahead of where they were a
decade ago, perhaps it is time to distill the knowledge gained so far
and pour it into a more modular build based on test driven
development, where the potentially highly valuable learning that
Freenet can offer could also potentially be reused by other projects,
if the want exists, for example to release Freenet from the server/pc
realm and out to be used in the current decade of mobile and tablet
devices.

With a robust framework for modeling real-world activity that can be
directed at portions of the Freenet app, such as load management, used
to drive demonstrable unit, functional and performance testing, the
volatility of the app can be reduced, as would be the need the
incessant, off-schedule mode of releasing updates. I am sure that a
more measured and planned mode of building out the app, as is seen in
the well regarded projects from the Apache foundation for example,
would reduce the pain and stress for developer and user alike, and
extend the life and impact of the project.

With the utmost regard for everyone who contributes to the project,

Sebastian Weetabix.

___
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl