[freenet-dev] Performance with almost disconnected darknets

2009-01-05 Thread vive
On Mon, Jan 05, 2009 at 07:19:35PM +, Matthew Toseland wrote:
> On Monday 05 January 2009 14:27, vive wrote:

> There are quite a few places on the wiki where parts of the design are 
> documented ... these are usually way out of date.
> > 
> > One could of course read the code ... but it may also be hard to get the
> > high-level description right-away from many details of lower-level code. :-)
> > Is it possible that someone who has insight into the details of the 
> heuristics
> > could write a short description?
> 
> The sinking strategy is simply that we cache in the store only when an insert 
> is on a node which is closer to the target than any of its peers. Do you want 
> more details on the routing heuristics? Or to start a more general debate on 
> how much detailed internal documentation we should make, where we should put 
> it etc?

I have understood how sinking works, but other major heuristics with routing and
storage would be really great to see documented! I am not the guy to analyze 
what
level of (overall) detail is best for the whole project. But these things have 
to
go into the simulation sooner or later to make it more realistic.

/v.
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: 



[freenet-dev] Performance with almost disconnected darknets

2009-01-05 Thread Matthew Toseland
On Monday 05 January 2009 14:27, vive wrote:
> On Tue, Dec 23, 2008 at 12:48:04PM +, Matthew Toseland wrote:
> > On Tuesday 23 December 2008 03:14, vive wrote:
> > > Simulation:
> > > 0. Load the two networks, each of size 4000. All nodes are set to be 
pure 
> > darknet.
> > > 1. Connect the networks by joining pairs of random nodes, where one node 
is 
> > picked from each network. I tried this with a cut of 10,100,1000,1 
links.
> > > 2. Nodes randomize their positions and swap for a while to stabilize the 
> > locations (successful swaps get more uncommon after this) to start from 
> > a "stable" state
> > > 3. 1000 keys are inserted by random nodes into the network (both sink 
and 
> > cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU 
as 
> > normal.
> > 
> > Actually we use pseudo-random replacement now. This probably doesn't 
strongly 
> > affect the results.
> 
> Is this documented? I was unable to find it. It would be great to have a 
place
> on the wiki where the heuristics related to 1. Routing 2. Caching/Storing 
are
> documented. Maybe at least a short description of each. This way, 
simulations
> can be made more realistic and design decisions made more clear. Right now 
the
> simulations do routing with backtracking as well as storing/caching 
according
> to the sinking strategy (store) and LRU (both store and cache).

There are quite a few places on the wiki where parts of the design are 
documented ... these are usually way out of date.
> 
> One could of course read the code ... but it may also be hard to get the
> high-level description right-away from many details of lower-level code. :-)
> Is it possible that someone who has insight into the details of the 
heuristics
> could write a short description?

The sinking strategy is simply that we cache in the store only when an insert 
is on a node which is closer to the target than any of its peers. Do you want 
more details on the routing heuristics? Or to start a more general debate on 
how much detailed internal documentation we should make, where we should put 
it etc?
> 
> Cheers,
> /vive
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 827 bytes
Desc: not available
URL: 



[freenet-dev] Performance with almost disconnected darknets

2009-01-05 Thread Matthew Toseland
On Monday 05 January 2009 14:17, vive wrote:
> On Tue, Dec 23, 2008 at 12:48:04PM +, Matthew Toseland wrote:
> > > 2. Data availability: the success rate and the number of steps in 
successful 
> > routes (here: the roundtrip steps) that is seen in simulation.
> > > 
> > > The simulation: this was made by joining both ideal (generated) social 
> > networks as well as networks from actual datasets on social networks (the 
pgp 
> > WoT). Both are navigable networks, although slighly differing. It is hoped 
> > that this can correspond roughly to the scenario above, by joining these 
> > networks with different numbers of random connections.
> > > 
> > > Simulation:
> > > 0. Load the two networks, each of size 4000. All nodes are set to be 
pure 
> > darknet.
> > > 1. Connect the networks by joining pairs of random nodes, where one node 
is 
> > picked from each network. I tried this with a cut of 10,100,1000,1 
links.
> > > 2. Nodes randomize their positions and swap for a while to stabilize the 
> > locations (successful swaps get more uncommon after this) to start from 
> > a "stable" state
> > > 3. 1000 keys are inserted by random nodes into the network (both sink 
and 
> > cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU 
as 
> > normal.
> > 
> > Actually we use pseudo-random replacement now. This probably doesn't 
strongly 
> > affect the results.
> > 
> > > 4. The main phase: nodes swap positions when they can, and request data 
for 
> > a longer interval. (At the moment no new inserts nor churn, this can be 
the 
> > next thing to simulate, but need data input, see below.) The goal is to 
see 
> > what the data availability is, and how the positions have got affected 
after 
> > some time. Nodes request for data via keys that have different popularity: 
a 
> > standard Zipf distribution is used over the keys. The parameter is also 
> > varied in the simulations. I have typially let the 4000 nodes do in total 
> > 10 requests and evaluated how that went. It could clearly be evaluated 
> > over longer time, but how long is reasonable?
> > > 5. Evaluate the two measures above (variation distance always starts 
close 
> > to 0 when one initially randomizes)
> > > 
> > The real world case most interesting to me is that of two darknets slowly 
> > merging. Does this result in a severe drop in performance in the short 
run? 
> > Hence, an important measurement is the success rate for requests made on 
the 
> > same side of the border as the data was inserted on. Is this dramatically 
> > lower with a smaller cut? How does it compare to that darknet running on 
its 
> > own?
> 
> Hi!
> 
> I did some simulations of this. See the attached plots. The model here is as
> described previously, but it tracks data and keys with respect to which part
> of the network they were originally inserted. Routes where nodes query for
> keys that were originally inserted in the same network part are labelled
> "short routes", otherwise "long routes" when they search for keys inserted 
in
> the other part.

This is great, it demonstrates with a reasonable degree of confidence that two 
darknets approaching each other and slowly merging will not have their local 
routing trashed, so we do not need to have any special handling for separate 
subnetworks; and, provided we assume the current model of just doing lots of 
requests, the most popular data from the other side will generally be 
available eventually and this improves dramatically as more links are 
established.
> 
> The impact of this varies depending on the size of the cut and how popular
> certain keys are. The plots illustrate both these dependencies.
> 
> The simulations show at least two things regarding to the previous fear. 
First,
> connecting two darknets loosely does not seem to break routing that much 
within
> a darknet, but shields the nodes off from data inserted in the other 
network.
> (And data is at least slowly made data available again. This is not very
> surprising, but here are the simulations showing some impact.) Second,
> real-world networks seem to be slightly better (than fair models where all
> nodes get the same number of connections) at making data available via long
> routes. This is seen by comparing the two plots where the zipf parameter is
> varied.
> The last effect is probably due to of a different degree distribution than 
the
> ideal model (then, a few high-degree nodes naturally see more queries 
passing
> through, quickly making remote data available for more nodes in their own
> network). 

It seems to be a lot more than slightly if I'm reading the plots right. Small 
world with fixed degree seems to be a lot slower than the somewhat-scale-free 
small world networks that real-world data tends to produce. But it's a 
tradeoff between performance and vulnerability ... ignoring the question of 
the capacity of well-connected nodes...
-- next part --
A non-text attachment 

[freenet-dev] Performance with almost disconnected darknets

2009-01-05 Thread vive
On Tue, Dec 23, 2008 at 12:48:04PM +, Matthew Toseland wrote:
> On Tuesday 23 December 2008 03:14, vive wrote:
> > Simulation:
> > 0. Load the two networks, each of size 4000. All nodes are set to be pure 
> darknet.
> > 1. Connect the networks by joining pairs of random nodes, where one node is 
> picked from each network. I tried this with a cut of 10,100,1000,1 links.
> > 2. Nodes randomize their positions and swap for a while to stabilize the 
> locations (successful swaps get more uncommon after this) to start from 
> a "stable" state
> > 3. 1000 keys are inserted by random nodes into the network (both sink and 
> cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU as 
> normal.
> 
> Actually we use pseudo-random replacement now. This probably doesn't strongly 
> affect the results.

Is this documented? I was unable to find it. It would be great to have a place
on the wiki where the heuristics related to 1. Routing 2. Caching/Storing are
documented. Maybe at least a short description of each. This way, simulations
can be made more realistic and design decisions made more clear. Right now the
simulations do routing with backtracking as well as storing/caching according
to the sinking strategy (store) and LRU (both store and cache).

One could of course read the code ... but it may also be hard to get the
high-level description right-away from many details of lower-level code. :-)
Is it possible that someone who has insight into the details of the heuristics
could write a short description?

Cheers,
/vive
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: 



[freenet-dev] Performance with almost disconnected darknets

2009-01-05 Thread vive
On Tue, Dec 23, 2008 at 12:48:04PM +, Matthew Toseland wrote:
> > 2. Data availability: the success rate and the number of steps in 
> > successful 
> routes (here: the roundtrip steps) that is seen in simulation.
> > 
> > The simulation: this was made by joining both ideal (generated) social 
> networks as well as networks from actual datasets on social networks (the pgp 
> WoT). Both are navigable networks, although slighly differing. It is hoped 
> that this can correspond roughly to the scenario above, by joining these 
> networks with different numbers of random connections.
> > 
> > Simulation:
> > 0. Load the two networks, each of size 4000. All nodes are set to be pure 
> darknet.
> > 1. Connect the networks by joining pairs of random nodes, where one node is 
> picked from each network. I tried this with a cut of 10,100,1000,1 links.
> > 2. Nodes randomize their positions and swap for a while to stabilize the 
> locations (successful swaps get more uncommon after this) to start from 
> a "stable" state
> > 3. 1000 keys are inserted by random nodes into the network (both sink and 
> cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU as 
> normal.
> 
> Actually we use pseudo-random replacement now. This probably doesn't strongly 
> affect the results.
> 
> > 4. The main phase: nodes swap positions when they can, and request data for 
> a longer interval. (At the moment no new inserts nor churn, this can be the 
> next thing to simulate, but need data input, see below.) The goal is to see 
> what the data availability is, and how the positions have got affected after 
> some time. Nodes request for data via keys that have different popularity: a 
> standard Zipf distribution is used over the keys. The parameter is also 
> varied in the simulations. I have typially let the 4000 nodes do in total 
> 10 requests and evaluated how that went. It could clearly be evaluated 
> over longer time, but how long is reasonable?
> > 5. Evaluate the two measures above (variation distance always starts close 
> to 0 when one initially randomizes)
> > 
> The real world case most interesting to me is that of two darknets slowly 
> merging. Does this result in a severe drop in performance in the short run? 
> Hence, an important measurement is the success rate for requests made on the 
> same side of the border as the data was inserted on. Is this dramatically 
> lower with a smaller cut? How does it compare to that darknet running on its 
> own?

Hi!

I did some simulations of this. See the attached plots. The model here is as
described previously, but it tracks data and keys with respect to which part
of the network they were originally inserted. Routes where nodes query for
keys that were originally inserted in the same network part are labelled
"short routes", otherwise "long routes" when they search for keys inserted in
the other part.

The impact of this varies depending on the size of the cut and how popular
certain keys are. The plots illustrate both these dependencies.

The simulations show at least two things regarding to the previous fear. First,
connecting two darknets loosely does not seem to break routing that much within
a darknet, but shields the nodes off from data inserted in the other network.
(And data is at least slowly made data available again. This is not very
surprising, but here are the simulations showing some impact.) Second,
real-world networks seem to be slightly better (than fair models where all
nodes get the same number of connections) at making data available via long
routes. This is seen by comparing the two plots where the zipf parameter is
varied.
The last effect is probably due to of a different degree distribution than the
ideal model (then, a few high-degree nodes naturally see more queries passing
through, quickly making remote data available for more nodes in their own
network). 

Note: the simulations are still without churn and opennet. The plan is to
involve these things next, but adding it is just likely to slow things
down rather than changing any of the patterns qualitatively. This is why
these simple simulations are justfied in the first place.

Note 2: the simulations were not averaged over many runs, but the patterns
clear. The result for zipfparam 1.5 in the first plot is certainly going to
be averaged out (there is a good theoretical reason this may happen in some
of the simulations).

Cheers,
/vive

-- next part --
A non-text attachment was scrubbed...
Name: Kleinberg_cut100_varzipf.png
Type: image/png
Size: 58357 bytes
Desc: not available
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: PGPWoT_cut100_varzipf.png
Type: image/png
Size: 63553 bytes
Desc: not available
URL: 


Re: [freenet-dev] Performance with almost disconnected darknets

2009-01-05 Thread vive
On Tue, Dec 23, 2008 at 12:48:04PM +, Matthew Toseland wrote:
 On Tuesday 23 December 2008 03:14, vive wrote:
  Simulation:
  0. Load the two networks, each of size 4000. All nodes are set to be pure 
 darknet.
  1. Connect the networks by joining pairs of random nodes, where one node is 
 picked from each network. I tried this with a cut of 10,100,1000,1 links.
  2. Nodes randomize their positions and swap for a while to stabilize the 
 locations (successful swaps get more uncommon after this) to start from 
 a stable state
  3. 1000 keys are inserted by random nodes into the network (both sink and 
 cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU as 
 normal.
 
 Actually we use pseudo-random replacement now. This probably doesn't strongly 
 affect the results.

Is this documented? I was unable to find it. It would be great to have a place
on the wiki where the heuristics related to 1. Routing 2. Caching/Storing are
documented. Maybe at least a short description of each. This way, simulations
can be made more realistic and design decisions made more clear. Right now the
simulations do routing with backtracking as well as storing/caching according
to the sinking strategy (store) and LRU (both store and cache).

One could of course read the code ... but it may also be hard to get the
high-level description right-away from many details of lower-level code. :-)
Is it possible that someone who has insight into the details of the heuristics
could write a short description?

Cheers,
/vive


pgp2X7Wq9CdT4.pgp
Description: PGP signature
___
Devl mailing list
Devl@freenetproject.org
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Performance with almost disconnected darknets

2009-01-05 Thread Matthew Toseland
On Monday 05 January 2009 14:17, vive wrote:
 On Tue, Dec 23, 2008 at 12:48:04PM +, Matthew Toseland wrote:
   2. Data availability: the success rate and the number of steps in 
successful 
  routes (here: the roundtrip steps) that is seen in simulation.
   
   The simulation: this was made by joining both ideal (generated) social 
  networks as well as networks from actual datasets on social networks (the 
pgp 
  WoT). Both are navigable networks, although slighly differing. It is hoped 
  that this can correspond roughly to the scenario above, by joining these 
  networks with different numbers of random connections.
   
   Simulation:
   0. Load the two networks, each of size 4000. All nodes are set to be 
pure 
  darknet.
   1. Connect the networks by joining pairs of random nodes, where one node 
is 
  picked from each network. I tried this with a cut of 10,100,1000,1 
links.
   2. Nodes randomize their positions and swap for a while to stabilize the 
  locations (successful swaps get more uncommon after this) to start from 
  a stable state
   3. 1000 keys are inserted by random nodes into the network (both sink 
and 
  cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU 
as 
  normal.
  
  Actually we use pseudo-random replacement now. This probably doesn't 
strongly 
  affect the results.
  
   4. The main phase: nodes swap positions when they can, and request data 
for 
  a longer interval. (At the moment no new inserts nor churn, this can be 
the 
  next thing to simulate, but need data input, see below.) The goal is to 
see 
  what the data availability is, and how the positions have got affected 
after 
  some time. Nodes request for data via keys that have different popularity: 
a 
  standard Zipf distribution is used over the keys. The parameter is also 
  varied in the simulations. I have typially let the 4000 nodes do in total 
  10 requests and evaluated how that went. It could clearly be evaluated 
  over longer time, but how long is reasonable?
   5. Evaluate the two measures above (variation distance always starts 
close 
  to 0 when one initially randomizes)
   
  The real world case most interesting to me is that of two darknets slowly 
  merging. Does this result in a severe drop in performance in the short 
run? 
  Hence, an important measurement is the success rate for requests made on 
the 
  same side of the border as the data was inserted on. Is this dramatically 
  lower with a smaller cut? How does it compare to that darknet running on 
its 
  own?
 
 Hi!
 
 I did some simulations of this. See the attached plots. The model here is as
 described previously, but it tracks data and keys with respect to which part
 of the network they were originally inserted. Routes where nodes query for
 keys that were originally inserted in the same network part are labelled
 short routes, otherwise long routes when they search for keys inserted 
in
 the other part.

This is great, it demonstrates with a reasonable degree of confidence that two 
darknets approaching each other and slowly merging will not have their local 
routing trashed, so we do not need to have any special handling for separate 
subnetworks; and, provided we assume the current model of just doing lots of 
requests, the most popular data from the other side will generally be 
available eventually and this improves dramatically as more links are 
established.
 
 The impact of this varies depending on the size of the cut and how popular
 certain keys are. The plots illustrate both these dependencies.
 
 The simulations show at least two things regarding to the previous fear. 
First,
 connecting two darknets loosely does not seem to break routing that much 
within
 a darknet, but shields the nodes off from data inserted in the other 
network.
 (And data is at least slowly made data available again. This is not very
 surprising, but here are the simulations showing some impact.) Second,
 real-world networks seem to be slightly better (than fair models where all
 nodes get the same number of connections) at making data available via long
 routes. This is seen by comparing the two plots where the zipf parameter is
 varied.
 The last effect is probably due to of a different degree distribution than 
the
 ideal model (then, a few high-degree nodes naturally see more queries 
passing
 through, quickly making remote data available for more nodes in their own
 network). 

It seems to be a lot more than slightly if I'm reading the plots right. Small 
world with fixed degree seems to be a lot slower than the somewhat-scale-free 
small world networks that real-world data tends to produce. But it's a 
tradeoff between performance and vulnerability ... ignoring the question of 
the capacity of well-connected nodes...


pgpY6Nhrmohav.pgp
Description: PGP signature
___
Devl mailing list
Devl@freenetproject.org

Re: [freenet-dev] Performance with almost disconnected darknets

2009-01-05 Thread Matthew Toseland
On Monday 05 January 2009 14:27, vive wrote:
 On Tue, Dec 23, 2008 at 12:48:04PM +, Matthew Toseland wrote:
  On Tuesday 23 December 2008 03:14, vive wrote:
   Simulation:
   0. Load the two networks, each of size 4000. All nodes are set to be 
pure 
  darknet.
   1. Connect the networks by joining pairs of random nodes, where one node 
is 
  picked from each network. I tried this with a cut of 10,100,1000,1 
links.
   2. Nodes randomize their positions and swap for a while to stabilize the 
  locations (successful swaps get more uncommon after this) to start from 
  a stable state
   3. 1000 keys are inserted by random nodes into the network (both sink 
and 
  cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU 
as 
  normal.
  
  Actually we use pseudo-random replacement now. This probably doesn't 
strongly 
  affect the results.
 
 Is this documented? I was unable to find it. It would be great to have a 
place
 on the wiki where the heuristics related to 1. Routing 2. Caching/Storing 
are
 documented. Maybe at least a short description of each. This way, 
simulations
 can be made more realistic and design decisions made more clear. Right now 
the
 simulations do routing with backtracking as well as storing/caching 
according
 to the sinking strategy (store) and LRU (both store and cache).

There are quite a few places on the wiki where parts of the design are 
documented ... these are usually way out of date.
 
 One could of course read the code ... but it may also be hard to get the
 high-level description right-away from many details of lower-level code. :-)
 Is it possible that someone who has insight into the details of the 
heuristics
 could write a short description?

The sinking strategy is simply that we cache in the store only when an insert 
is on a node which is closer to the target than any of its peers. Do you want 
more details on the routing heuristics? Or to start a more general debate on 
how much detailed internal documentation we should make, where we should put 
it etc?
 
 Cheers,
 /vive


pgpw1f7MQEAUB.pgp
Description: PGP signature
___
Devl mailing list
Devl@freenetproject.org
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Performance with almost disconnected darknets

2009-01-05 Thread vive
On Mon, Jan 05, 2009 at 07:19:35PM +, Matthew Toseland wrote:
 On Monday 05 January 2009 14:27, vive wrote:

 There are quite a few places on the wiki where parts of the design are 
 documented ... these are usually way out of date.
  
  One could of course read the code ... but it may also be hard to get the
  high-level description right-away from many details of lower-level code. :-)
  Is it possible that someone who has insight into the details of the 
 heuristics
  could write a short description?
 
 The sinking strategy is simply that we cache in the store only when an insert 
 is on a node which is closer to the target than any of its peers. Do you want 
 more details on the routing heuristics? Or to start a more general debate on 
 how much detailed internal documentation we should make, where we should put 
 it etc?

I have understood how sinking works, but other major heuristics with routing and
storage would be really great to see documented! I am not the guy to analyze 
what
level of (overall) detail is best for the whole project. But these things have 
to
go into the simulation sooner or later to make it more realistic.

/v.


pgpUbXC3txwVB.pgp
Description: PGP signature
___
Devl mailing list
Devl@freenetproject.org
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] Performance with almost disconnected darknets

2008-12-23 Thread Matthew Toseland
On Tuesday 23 December 2008 03:14, vive wrote:
> Hi list,
> 
> I have made some simulations to study what happens when two social networks 
are connected only by some small number of links (from now on the "cut"), 
while freenet routing is attempted between users running nodes in those 
networks. This scenario may appear with a powerful attacker who can e.g. 
implement a filtering mechanism between two social networks. 

Or it may occur naturally as two growing darknets meet each other, in which 
case hopefully it will be a transitional phase.

> The question is how hard it will strike against the availability of data 
when nodes get increasingly separated. 
> 
> It is obvious that anyone who can split up the network based on the keyspace 
will be able to separate nodes from data to some degree (in reality this also 
includes the caching mechanism). But this simulation examines something else: 
the impact of two underlying social networks (with users wanting to run full 
darknet) potentially connected by only a small number of links. The concern 
(discussed previously e.g. over irc) has been that different parts of the 
network may still become highly specialized in locations as a consequence of 
the constant swapping process that is going on. If that would happen, nodes 
in one network would sooner or later forget data that becomes harder to reach 
in the other network. 
> 
> Short description of the result (and simulation details after that): 
simulations indicate that the specialization of the positions does not happen 
that much, it happens quite little when the cut is small so swap requests 
seldomly move between the networks. When the cut between the networks is 
larger, it increasingly happens. But with a larger cut it is also easier to 
navigate between the networks and reach the data ... The main problem is that 
the availability of data when the cut is really small gets lowered quite a 
bit. For more, see the simulation and plots below.
> 
> There are at least two ways to measure the impact of this problem. First, 
how much the locations specialize in different networks and whether the 
locations in the different networks "diverge" just a little or a lot. Second, 
the availability of data under some kind of request/insert model (that we 
should have confidence in at least resembling something about reality), as 
described below.
> 
> 1. Measure of location specialization: I have considered the distribution of 
the positions in two separate networks. What one wants in any "healthy" 
network is a uniform distribution of positions, and one should use a measure 
of the distance between an actual distribution and the uniform distribution. 
For this, the variation distance was computed. The measure is on the 
difference between how two distributions put their probability "mass" onto 
the same underlying space (here: the positions). In theory the variation 
distance can be anything between 0 and 1 (if the distributions completely 
diverge), here a little bit less than 1 (there is no way here to "escape" the 
reference of a uniform distribution). We want this measure to be as close to 
0 as possible (that is, the positions to be uniformly spread over the 
keyspace). It thus measures when many positions have "escaped" from one of 
the networks, and gets higher quicker when these positions disappear from 
specific parts of the keyspace.
> 
> 2. Data availability: the success rate and the number of steps in successful 
routes (here: the roundtrip steps) that is seen in simulation.
> 
> The simulation: this was made by joining both ideal (generated) social 
networks as well as networks from actual datasets on social networks (the pgp 
WoT). Both are navigable networks, although slighly differing. It is hoped 
that this can correspond roughly to the scenario above, by joining these 
networks with different numbers of random connections.
> 
> Simulation:
> 0. Load the two networks, each of size 4000. All nodes are set to be pure 
darknet.
> 1. Connect the networks by joining pairs of random nodes, where one node is 
picked from each network. I tried this with a cut of 10,100,1000,1 links.
> 2. Nodes randomize their positions and swap for a while to stabilize the 
locations (successful swaps get more uncommon after this) to start from 
a "stable" state
> 3. 1000 keys are inserted by random nodes into the network (both sink and 
cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU as 
normal.

Actually we use pseudo-random replacement now. This probably doesn't strongly 
affect the results.

> 4. The main phase: nodes swap positions when they can, and request data for 
a longer interval. (At the moment no new inserts nor churn, this can be the 
next thing to simulate, but need data input, see below.) The goal is to see 
what the data availability is, and how the positions have got affected after 
some time. Nodes request for data via keys that have different popularity: a 
standard 

[freenet-dev] Performance with almost disconnected darknets

2008-12-23 Thread vive
Hi list,

I have made some simulations to study what happens when two social networks are 
connected only by some small number of links (from now on the "cut"), while 
freenet routing is attempted between users running nodes in those networks. 
This scenario may appear with a powerful attacker who can e.g. implement a 
filtering mechanism between two social networks. The question is how hard it 
will strike against the availability of data when nodes get increasingly 
separated.

It is obvious that anyone who can split up the network based on the keyspace 
will be able to separate nodes from data to some degree (in reality this also 
includes the caching mechanism). But this simulation examines something else: 
the impact of two underlying social networks (with users wanting to run full 
darknet) potentially connected by only a small number of links. The concern 
(discussed previously e.g. over irc) has been that different parts of the 
network may still become highly specialized in locations as a consequence of 
the constant swapping process that is going on. If that would happen, nodes in 
one network would sooner or later forget data that becomes harder to reach in 
the other network. 

Short description of the result (and simulation details after that): 
simulations indicate that the specialization of the positions does not happen 
that much, it happens quite little when the cut is small so swap requests 
seldomly move between the networks. When the cut between the networks is 
larger, it increasingly happens. But with a larger cut it is also easier to 
navigate between the networks and reach the data ... The main problem is that 
the availability of data when the cut is really small gets lowered quite a bit. 
For more, see the simulation and plots below.

There are at least two ways to measure the impact of this problem. First, how 
much the locations specialize in different networks and whether the locations 
in the different networks "diverge" just a little or a lot. Second, the 
availability of data under some kind of request/insert model (that we should 
have confidence in at least resembling something about reality), as described 
below.

1. Measure of location specialization: I have considered the distribution of 
the positions in two separate networks. What one wants in any "healthy" network 
is a uniform distribution of positions, and one should use a measure of the 
distance between an actual distribution and the uniform distribution. For this, 
the variation distance was computed. The measure is on the difference between 
how two distributions put their probability "mass" onto the same underlying 
space (here: the positions). In theory the variation distance can be anything 
between 0 and 1 (if the distributions completely diverge), here a little bit 
less than 1 (there is no way here to "escape" the reference of a uniform 
distribution). We want this measure to be as close to 0 as possible (that is, 
the positions to be uniformly spread over the keyspace). It thus measures when 
many positions have "escaped" from one of the networks, and gets higher quicker 
when these positions disappear from specific parts of the keyspace.

2. Data availability: the success rate and the number of steps in successful 
routes (here: the roundtrip steps) that is seen in simulation.

The simulation: this was made by joining both ideal (generated) social networks 
as well as networks from actual datasets on social networks (the pgp WoT). Both 
are navigable networks, although slighly differing. It is hoped that this can 
correspond roughly to the scenario above, by joining these networks with 
different numbers of random connections.

Simulation:
0. Load the two networks, each of size 4000. All nodes are set to be pure 
darknet.
1. Connect the networks by joining pairs of random nodes, where one node is 
picked from each network. I tried this with a cut of 10,100,1000,1 links.
2. Nodes randomize their positions and swap for a while to stabilize the 
locations (successful swaps get more uncommon after this) to start from a 
"stable" state
3. 1000 keys are inserted by random nodes into the network (both sink and 
cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU as 
normal.
4. The main phase: nodes swap positions when they can, and request data for a 
longer interval. (At the moment no new inserts nor churn, this can be the next 
thing to simulate, but need data input, see below.) The goal is to see what the 
data availability is, and how the positions have got affected after some time. 
Nodes request for data via keys that have different popularity: a standard Zipf 
distribution is used over the keys. The parameter is also varied in the 
simulations. I have typially let the 4000 nodes do in total 10 requests and 
evaluated how that went. It could clearly be evaluated over longer time, but 
how long is reasonable?
5. Evaluate the two measures above (variation distance always 

Re: [freenet-dev] Performance with almost disconnected darknets

2008-12-23 Thread Matthew Toseland
On Tuesday 23 December 2008 03:14, vive wrote:
 Hi list,
 
 I have made some simulations to study what happens when two social networks 
are connected only by some small number of links (from now on the cut), 
while freenet routing is attempted between users running nodes in those 
networks. This scenario may appear with a powerful attacker who can e.g. 
implement a filtering mechanism between two social networks. 

Or it may occur naturally as two growing darknets meet each other, in which 
case hopefully it will be a transitional phase.

 The question is how hard it will strike against the availability of data 
when nodes get increasingly separated. 
 
 It is obvious that anyone who can split up the network based on the keyspace 
will be able to separate nodes from data to some degree (in reality this also 
includes the caching mechanism). But this simulation examines something else: 
the impact of two underlying social networks (with users wanting to run full 
darknet) potentially connected by only a small number of links. The concern 
(discussed previously e.g. over irc) has been that different parts of the 
network may still become highly specialized in locations as a consequence of 
the constant swapping process that is going on. If that would happen, nodes 
in one network would sooner or later forget data that becomes harder to reach 
in the other network. 
 
 Short description of the result (and simulation details after that): 
simulations indicate that the specialization of the positions does not happen 
that much, it happens quite little when the cut is small so swap requests 
seldomly move between the networks. When the cut between the networks is 
larger, it increasingly happens. But with a larger cut it is also easier to 
navigate between the networks and reach the data ... The main problem is that 
the availability of data when the cut is really small gets lowered quite a 
bit. For more, see the simulation and plots below.
 
 There are at least two ways to measure the impact of this problem. First, 
how much the locations specialize in different networks and whether the 
locations in the different networks diverge just a little or a lot. Second, 
the availability of data under some kind of request/insert model (that we 
should have confidence in at least resembling something about reality), as 
described below.
 
 1. Measure of location specialization: I have considered the distribution of 
the positions in two separate networks. What one wants in any healthy 
network is a uniform distribution of positions, and one should use a measure 
of the distance between an actual distribution and the uniform distribution. 
For this, the variation distance was computed. The measure is on the 
difference between how two distributions put their probability mass onto 
the same underlying space (here: the positions). In theory the variation 
distance can be anything between 0 and 1 (if the distributions completely 
diverge), here a little bit less than 1 (there is no way here to escape the 
reference of a uniform distribution). We want this measure to be as close to 
0 as possible (that is, the positions to be uniformly spread over the 
keyspace). It thus measures when many positions have escaped from one of 
the networks, and gets higher quicker when these positions disappear from 
specific parts of the keyspace.
 
 2. Data availability: the success rate and the number of steps in successful 
routes (here: the roundtrip steps) that is seen in simulation.
 
 The simulation: this was made by joining both ideal (generated) social 
networks as well as networks from actual datasets on social networks (the pgp 
WoT). Both are navigable networks, although slighly differing. It is hoped 
that this can correspond roughly to the scenario above, by joining these 
networks with different numbers of random connections.
 
 Simulation:
 0. Load the two networks, each of size 4000. All nodes are set to be pure 
darknet.
 1. Connect the networks by joining pairs of random nodes, where one node is 
picked from each network. I tried this with a cut of 10,100,1000,1 links.
 2. Nodes randomize their positions and swap for a while to stabilize the 
locations (successful swaps get more uncommon after this) to start from 
a stable state
 3. 1000 keys are inserted by random nodes into the network (both sink and 
cache). Nodes have a store of 50 keys and a cache of 50 keys. Both are LRU as 
normal.

Actually we use pseudo-random replacement now. This probably doesn't strongly 
affect the results.

 4. The main phase: nodes swap positions when they can, and request data for 
a longer interval. (At the moment no new inserts nor churn, this can be the 
next thing to simulate, but need data input, see below.) The goal is to see 
what the data availability is, and how the positions have got affected after 
some time. Nodes request for data via keys that have different popularity: a 
standard Zipf distribution is used over the keys.