Tavin, I'm THRILLED to see you've chased this down to this point. I've been adding toString() methods to all kinds of classes trying to get a list of what all those damned threads were up to, suspecting something like this was causing my node to degrade similarly.
For what it's worth, my node has been a happy camper (9k requests handled/hour) since I stopped announcing and edited my noderefs to only include nodes with the latest build. The latest build is the best I've seen to date. In terms of defensive coding, would it be possible to have the threads time out and just kill 'em if they've been running too long? Also towards nodes behaving defensively, the node could restart itself every N (chosen randomly) hours. This would also clean out whatever's leaking (my javaw grows to 83 megs over 30 hours). Actually, this would be a way to kill those locked threads, too. -glenn, being way too pragmatic for this bunch, but what the hell > -----Original Message----- > From: devl-admin at freenetproject.org > [mailto:devl-admin at freenetproject.org]On Behalf Of Tavin Cole > Sent: Wednesday, February 13, 2002 9:50 AM > To: devl at freenetproject.org > Subject: [freenet-devl] roadmap to 0.5, regulating and adapting to load, > thread-lock bugs > > > I have been observing the functioning of the network and the diagnostics > recorded by hawk. Here is a typical sampling of inbound aggregate > requests received vs. those actually handled: > > hour received handled > > 1013486400 547 525 > 1013490000 595 418 > 1013493600 638 148 > 1013497200 826 25 > 1013500800 1039 3 > 1013504400 1226 0 > 1013508000 799 0 > 1013511600 991 0 > 1013515200 959 0 > 1013518800 1313 0 > 1013522400 1520 0 > 1013526000 1053 0 > 1013529600 746 0 > 1013533200 430 0 > 1013536800 353 0 > 1013540400 179 0 > 1013544000 126 0 > 1013547600 124 0 > 1013551200 83 0 > 1013554800 69 0 > 1013558400 109 0 > 1013562000 132 0 > 1013565600 168 0 > 1013569200 136 0 > 1013572800 145 0 > 1013576400 154 0 > > (for the next 8 hours, other nodes finally stop trying to send any > requests at all) > > As you can see, the node eventually enters a state where it QueryRejects > 100% of incoming requests, and the network doesn't adapt to that > very well. > The pattern occurs repeatably every time hawk is restarted. > > I obtained a thread dump from hawk and found some nasty thread deadlocks > that are eating up all the threads and hence making the node think it's > under a lot of load. The vast majority of the threads were stuck at one > of the following two points: > > "PThread-104" prio=5 tid=0x80be000 nid=0x31d waiting on monitor > [0xb19ff000..0xb19ff874] > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:420) > at > freenet.ConnectionHandler.sendMessage(ConnectionHandler.java:375) > > "PThread-102" prio=5 tid=0x80bc800 nid=0x31b waiting on monitor > [0xb1dff000..0xb1dff874] > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:420) > at freenet.ConnectionHandler.run(ConnectionHandler.java:301) > > A smaller number were stuck here: > > "PThread-105" prio=5 tid=0x80bec00 nid=0x31e waiting on monitor > [0xb17ff000..0xb17ff874] > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:420) > at > freenet.node.ds.FSDataStoreElement.getFailureCode(FSDataStoreEleme > nt.java:56) > > > No doubt this is happening to any node left running for a while and > thereby killing the network and resulting in a lot of DataNotFounds as > nodes are unable to route to their first choices for a key. Recent > network performance would seem to confirm this. > > Clearly, this isn't good enough for a stable release. Not only does the > node need to be able to hold up under load, but the other nodes' routing > should adapt more quickly to a node that always rejects requests. > > The first point requires more work on the threading and connection > management code. I have some plans for this. The second point requires > us to think more about how to deal with application-level feedback from > routing. > > Oskar's idea to use QueryRejecteds for load regulation at the > application level was the first step. It introduced some negative > feedback. Now we need to apply that feedback to the routing logic. > Right now all we do is hope that hawk will drop out of the routing table > as we accumulate references to other nodes. This is just too slow. The > options I can see are: > > 1. factor QueryRejecteds into the CP (ugly, mixes layers) > 2. introduce an application-layer probabilistic factor like CP > (might as well just do #1) > 3. only send N requests at a time, where N is some small integer, > waiting for the Accepteds before sending more. bail out on all > queued requests if we get a QueryRejected instead of Accepted. > (arbitrary and bad for performance) > 4. reintroduce ref deletion. when we receive a QueryRejected, delete > the ref with a probability of 1 - 1/(no. of refs to that node). > the probability is to prevent removing the last ref to that node. > > I am favoring #4. I think we should use it for timeouts as well as > rejected requests. > > -tc > > _______________________________________________ Devl mailing list Devl at freenetproject.org http://lists.freenetproject.org/mailman/listinfo/devl
