Questions about gathering information and statistics about the tor-network

Sebastian Schmidt Wed, 14 Jan 2009 12:41:24 -0800

Hi,
I'm writing a tool right now to gather some longtime statistics about the 
tor-network. I want to plot these hourly taken information (e.g. with gnuplot) 
to offer plots on a daily/weekly/monthly/yearly base about the tor-network.


I think this is usefull (for the tor-development and the interested users) to 
observe the development of the tor-network over the time like: is the number of 
nodes growing/shrinking, are routers positions spreading more around the world 
over time or starting to even more concentrate on some countrys like the US, 
Germany,.. , number of and relation of exit-to entry-/middle-nodes, average 
uptime of the nodes, development of which ports are being blocked by the nodes, 
is the average bandwith of the network growing or shrinking and so on...

There are some informations which can be easily collected by the single 
server-descriptors by simply asking the control-port like: the number of nodes, 
with geoiplookup and their IP's also their country, the uptime and the blocked 
ports and stuff like this.

But there are some informations which are interesting too which aren't as 
easily to gather:

1.) the number of users: this would be a cool information but I don't know if 
there's at the moment any way also even just to roughly estimate the number of 
users. There are in my opinion just two places where such informations could a 
bit reliable be gathered but both are out of the game because of the current 
implementation to offer a good security. And one way (place) to get a rough 
estimation not of the number of users but if this number is growing or 
shrinking.

a.) the entry-nodes: every entry-node knows (or can know) how many individual 
users ( at least individual IPs ) are connecting to it right now. But because 
we don't know how many different circuits a user has open at one moment, we 
can't say how many users we have in total even if all entry-nodes would report 
the number of currently individual connections it has. Only workaround would be 
throwing all the information of all entry-nodes with all IPs of all users in 
one pot. But this would be a very very bad idea. So gathering the number of 
users based on entry-nodes is not going to work (at least not if we want the 
network to be as safe as it is at the moment).

b.) the directory-servers: if all clients would ask the directory-servers in a 
constant intervall for new information we could gather the number of requests 
per dir-server per 24h hours and divide it with the interval lenghts. But this 
has two problems: one is that not every client is on 24h per day so the 
information would be pretty unreliable even if we would guess an average time a 
client is online within 24h. The other is that the implementation 
(https://svn.torproject.org/svn/tor/trunk/doc/spec/dir-spec.txt under 5.1) 
isn't a static interval for all clients but more randomly choosen. So also this 
is no option by a matter of fact that we don't know how long each client is up 
and the random interval.

c.) the number of downloads of a new released tor version: the number of 
downloads of a new stable release of the tor-client could give an hint if the 
number of users is growing or shrinking. Of course this could just be collected 
on the tor-project page and thus would just be a snippet of all downloads/users 
because there are e.g. many users of modern operationsystems ( yes some small 
bang against MS/MacOS/Sun ;) ) which offer a packagemanagment-system and don't 
compile by hand. Those downloads and updates can't be count but even this 
snippet of downloads of a new stable-version (maybe within one week after it 
has been released) could give some impression if we compair this number to 
prior releases if the average number of users is growing or shrinking. 

2.) the network health: network health can be understood in many different 
ways. One aspect I thought of would be the comparison of the bandwith all nodes 
are offering compaired to the bandwidth which is acutally used under the 
premise that we have enough users to consume all the bandwidth the nodes do 
offer (and I think we can safely make this premise). A good network health 
would mean under this condition that the bandwith which is acutally used is 
nearly the same as the bandwith the nodes offer. This gives an estimation of 
how good is tor on building circuits. If there are some nodes which aren't used 
all the bandwith they have to offer and other nodes which are nearly breaking 
under the bandwith they are asked for it means tor isn't doing well on 
assembling circuits. Also interesting would be here the number of connections 
each node has compaired with the bandwith it offers but the number of 
connections isn't exported at all. At least I couldn't find it in the 
service-descriptor. I came to think about this by simple tests. Building a 
circuit with three really fast nodes gives you more bandwidth than building a 
circuit with three really slow nodes. But on a healt network you would have the 
same bandwidth in any case because the number of connections through the slow 
ones would be lowered and on the fast nodes increased until they offer the same 
bandwidth to their users.

But also with simple checking the bandwith we have some limitations (at least 
as I understand the specs: 
https://svn.torproject.org/svn/tor/trunk/doc/spec/dir-spec.txt under 2.1). We 
have bandwidth-avg and bandwidth-observed (burst is kind of useless here for us 
as I think). I don't know how these values are gathered, the specs are a bit 
unprecisly here but they are pretty different if I take a look at them. 
Sometimes the observed value is less than 10% of the avg so I don't know if 
this value is usefull/accurate? It would be cool if a router tells us how much 
it is willing to share and how much it is acutally sharing but afaik we don't 
have the bandwidth a router is willing to share but just how much it is sharing 
which is bandwidth-avg or? Am I interpreting this correct?


I wanted to ask what you think about the idea to create such statistics at all? 
And have you some better ideas or thoughts about the number of users and the 
network-health?

greetings
          Sebastian

signature.asc
Description: PGP signature

Questions about gathering information and statistics about the tor-network

Reply via email to