I have been observing the functioning of the network and the diagnostics
recorded by hawk. Here is a typical sampling of inbound aggregate
requests received vs. those actually handled:
hour received handled
1013486400 547 525
1013490000 595 418
1013493600 638 148
1013497200 826 25
1013500800 1039 3
1013504400 1226 0
1013508000 799 0
1013511600 991 0
1013515200 959 0
1013518800 1313 0
1013522400 1520 0
1013526000 1053 0
1013529600 746 0
1013533200 430 0
1013536800 353 0
1013540400 179 0
1013544000 126 0
1013547600 124 0
1013551200 83 0
1013554800 69 0
1013558400 109 0
1013562000 132 0
1013565600 168 0
1013569200 136 0
1013572800 145 0
1013576400 154 0
(for the next 8 hours, other nodes finally stop trying to send any
requests at all)
As you can see, the node eventually enters a state where it QueryRejects
100% of incoming requests, and the network doesn't adapt to that very well.
The pattern occurs repeatably every time hawk is restarted.
I obtained a thread dump from hawk and found some nasty thread deadlocks
that are eating up all the threads and hence making the node think it's
under a lot of load. The vast majority of the threads were stuck at one
of the following two points:
"PThread-104" prio=5 tid=0x80be000 nid=0x31d waiting on monitor
[0xb19ff000..0xb19ff874]
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:420)
at freenet.ConnectionHandler.sendMessage(ConnectionHandler.java:375)
"PThread-102" prio=5 tid=0x80bc800 nid=0x31b waiting on monitor
[0xb1dff000..0xb1dff874]
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:420)
at freenet.ConnectionHandler.run(ConnectionHandler.java:301)
A smaller number were stuck here:
"PThread-105" prio=5 tid=0x80bec00 nid=0x31e waiting on monitor
[0xb17ff000..0xb17ff874]
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:420)
at
freenet.node.ds.FSDataStoreElement.getFailureCode(FSDataStoreElement.java:56)
No doubt this is happening to any node left running for a while and
thereby killing the network and resulting in a lot of DataNotFounds as
nodes are unable to route to their first choices for a key. Recent
network performance would seem to confirm this.
Clearly, this isn't good enough for a stable release. Not only does the
node need to be able to hold up under load, but the other nodes' routing
should adapt more quickly to a node that always rejects requests.
The first point requires more work on the threading and connection
management code. I have some plans for this. The second point requires
us to think more about how to deal with application-level feedback from
routing.
Oskar's idea to use QueryRejecteds for load regulation at the
application level was the first step. It introduced some negative
feedback. Now we need to apply that feedback to the routing logic.
Right now all we do is hope that hawk will drop out of the routing table
as we accumulate references to other nodes. This is just too slow. The
options I can see are:
1. factor QueryRejecteds into the CP (ugly, mixes layers)
2. introduce an application-layer probabilistic factor like CP
(might as well just do #1)
3. only send N requests at a time, where N is some small integer,
waiting for the Accepteds before sending more. bail out on all
queued requests if we get a QueryRejected instead of Accepted.
(arbitrary and bad for performance)
4. reintroduce ref deletion. when we receive a QueryRejected, delete
the ref with a probability of 1 - 1/(no. of refs to that node).
the probability is to prevent removing the last ref to that node.
I am favoring #4. I think we should use it for timeouts as well as
rejected requests.
-tc
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 240 bytes
Desc: not available
URL:
<https://emu.freenetproject.org/pipermail/devl/attachments/20020213/0b8ec8c3/attachment.pgp>