Tavin,

I'm THRILLED to see you've chased this down to this point.  I've been adding
toString() methods to all kinds of classes trying to get a list of what all
those damned threads were up to, suspecting something like this was causing
my node to degrade similarly.

For what it's worth, my node has been a happy camper (9k requests
handled/hour) since I stopped announcing and edited my noderefs to only
include nodes with the latest build.  The latest build is the best I've seen
to date.

In terms of defensive coding, would it be possible to have the threads time
out and just kill 'em if they've been running too long?

Also towards nodes behaving defensively, the node could restart itself every
N (chosen randomly) hours.  This would also clean out whatever's leaking (my
javaw grows to 83 megs over 30 hours).  Actually, this would be a way to
kill those locked threads, too.

-glenn, being way too pragmatic for this bunch, but what the hell



> -----Original Message-----
> From: devl-admin at freenetproject.org
> [mailto:devl-admin at freenetproject.org]On Behalf Of Tavin Cole
> Sent: Wednesday, February 13, 2002 9:50 AM
> To: devl at freenetproject.org
> Subject: [freenet-devl] roadmap to 0.5, regulating and adapting to load,
> thread-lock bugs
>
>
> I have been observing the functioning of the network and the diagnostics
> recorded by hawk.  Here is a typical sampling of inbound aggregate
> requests received vs. those actually handled:
>
> hour        received    handled
>
> 1013486400  547         525
> 1013490000  595         418
> 1013493600  638         148
> 1013497200  826         25
> 1013500800  1039        3
> 1013504400  1226        0
> 1013508000  799         0
> 1013511600  991         0
> 1013515200  959         0
> 1013518800  1313        0
> 1013522400  1520        0
> 1013526000  1053        0
> 1013529600  746         0
> 1013533200  430         0
> 1013536800  353         0
> 1013540400  179         0
> 1013544000  126         0
> 1013547600  124         0
> 1013551200  83          0
> 1013554800  69          0
> 1013558400  109         0
> 1013562000  132         0
> 1013565600  168         0
> 1013569200  136         0
> 1013572800  145         0
> 1013576400  154         0
>
> (for the next 8 hours, other nodes finally stop trying to send any
>  requests at all)
>
> As you can see, the node eventually enters a state where it QueryRejects
> 100% of incoming requests, and the network doesn't adapt to that
> very well.
> The pattern occurs repeatably every time hawk is restarted.
>
> I obtained a thread dump from hawk and found some nasty thread deadlocks
> that are eating up all the threads and hence making the node think it's
> under a lot of load.  The vast majority of the threads were stuck at one
> of the following two points:
>
> "PThread-104" prio=5 tid=0x80be000 nid=0x31d waiting on monitor
> [0xb19ff000..0xb19ff874]
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:420)
>         at
> freenet.ConnectionHandler.sendMessage(ConnectionHandler.java:375)
>
> "PThread-102" prio=5 tid=0x80bc800 nid=0x31b waiting on monitor
> [0xb1dff000..0xb1dff874]
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:420)
>         at freenet.ConnectionHandler.run(ConnectionHandler.java:301)
>
> A smaller number were stuck here:
>
> "PThread-105" prio=5 tid=0x80bec00 nid=0x31e waiting on monitor
> [0xb17ff000..0xb17ff874]
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:420)
>         at
> freenet.node.ds.FSDataStoreElement.getFailureCode(FSDataStoreEleme
> nt.java:56)
>
>
> No doubt this is happening to any node left running for a while and
> thereby killing the network and resulting in a lot of DataNotFounds as
> nodes are unable to route to their first choices for a key.  Recent
> network performance would seem to confirm this.
>
> Clearly, this isn't good enough for a stable release.  Not only does the
> node need to be able to hold up under load, but the other nodes' routing
> should adapt more quickly to a node that always rejects requests.
>
> The first point requires more work on the threading and connection
> management code.  I have some plans for this.  The second point requires
> us to think more about how to deal with application-level feedback from
> routing.
>
> Oskar's idea to use QueryRejecteds for load regulation at the
> application level was the first step.  It introduced some negative
> feedback.  Now we need to apply that feedback to the routing logic.
> Right now all we do is hope that hawk will drop out of the routing table
> as we accumulate references to other nodes.  This is just too slow.  The
> options I can see are:
>
> 1. factor QueryRejecteds into the CP (ugly, mixes layers)
> 2. introduce an application-layer probabilistic factor like CP
>    (might as well just do #1)
> 3. only send N requests at a time, where N is some small integer,
>    waiting for the Accepteds before sending more.  bail out on all
>    queued requests if we get a QueryRejected instead of Accepted.
>    (arbitrary and bad for performance)
> 4. reintroduce ref deletion.  when we receive a QueryRejected, delete
>    the ref with a probability of 1 - 1/(no. of refs to that node).
>    the probability is to prevent removing the last ref to that node.
>
> I am favoring #4.  I think we should use it for timeouts as well as
> rejected requests.
>
> -tc
>
>


_______________________________________________
Devl mailing list
Devl at freenetproject.org
http://lists.freenetproject.org/mailman/listinfo/devl

Reply via email to