I have been observing the functioning of the network and the diagnostics
recorded by hawk.  Here is a typical sampling of inbound aggregate
requests received vs. those actually handled:

hour        received    handled

1013486400  547         525
1013490000  595         418
1013493600  638         148
1013497200  826         25
1013500800  1039        3
1013504400  1226        0
1013508000  799         0
1013511600  991         0
1013515200  959         0
1013518800  1313        0
1013522400  1520        0
1013526000  1053        0
1013529600  746         0
1013533200  430         0
1013536800  353         0
1013540400  179         0
1013544000  126         0
1013547600  124         0
1013551200  83          0
1013554800  69          0
1013558400  109         0
1013562000  132         0
1013565600  168         0
1013569200  136         0
1013572800  145         0
1013576400  154         0

(for the next 8 hours, other nodes finally stop trying to send any
 requests at all)

As you can see, the node eventually enters a state where it QueryRejects
100% of incoming requests, and the network doesn't adapt to that very well.
The pattern occurs repeatably every time hawk is restarted.

I obtained a thread dump from hawk and found some nasty thread deadlocks
that are eating up all the threads and hence making the node think it's
under a lot of load.  The vast majority of the threads were stuck at one
of the following two points:

"PThread-104" prio=5 tid=0x80be000 nid=0x31d waiting on monitor
[0xb19ff000..0xb19ff874]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:420)
        at freenet.ConnectionHandler.sendMessage(ConnectionHandler.java:375)

"PThread-102" prio=5 tid=0x80bc800 nid=0x31b waiting on monitor
[0xb1dff000..0xb1dff874]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:420)
        at freenet.ConnectionHandler.run(ConnectionHandler.java:301)

A smaller number were stuck here:

"PThread-105" prio=5 tid=0x80bec00 nid=0x31e waiting on monitor
[0xb17ff000..0xb17ff874]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:420)
        at 
freenet.node.ds.FSDataStoreElement.getFailureCode(FSDataStoreElement.java:56)


No doubt this is happening to any node left running for a while and
thereby killing the network and resulting in a lot of DataNotFounds as
nodes are unable to route to their first choices for a key.  Recent
network performance would seem to confirm this.

Clearly, this isn't good enough for a stable release.  Not only does the
node need to be able to hold up under load, but the other nodes' routing
should adapt more quickly to a node that always rejects requests.

The first point requires more work on the threading and connection
management code.  I have some plans for this.  The second point requires
us to think more about how to deal with application-level feedback from
routing.

Oskar's idea to use QueryRejecteds for load regulation at the
application level was the first step.  It introduced some negative
feedback.  Now we need to apply that feedback to the routing logic.
Right now all we do is hope that hawk will drop out of the routing table
as we accumulate references to other nodes.  This is just too slow.  The
options I can see are:

1. factor QueryRejecteds into the CP (ugly, mixes layers)
2. introduce an application-layer probabilistic factor like CP
   (might as well just do #1)
3. only send N requests at a time, where N is some small integer,
   waiting for the Accepteds before sending more.  bail out on all
   queued requests if we get a QueryRejected instead of Accepted.
   (arbitrary and bad for performance)
4. reintroduce ref deletion.  when we receive a QueryRejected, delete
   the ref with a probability of 1 - 1/(no. of refs to that node).
   the probability is to prevent removing the last ref to that node.

I am favoring #4.  I think we should use it for timeouts as well as
rejected requests.

-tc

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 240 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20020213/0b8ec8c3/attachment.pgp>

Reply via email to