> > -----Original Message----- > > From: devl-admin at freenetproject.org > > [mailto:devl-admin at freenetproject.org]On Behalf Of Tavin Cole > > Sent: Wednesday, February 13, 2002 9:50 AM > > To: devl at freenetproject.org > > Subject: [freenet-devl] roadmap to 0.5, regulating and adapting to load, > > thread-lock bugs > > > > > > I have been observing the functioning of the network and the diagnostics > > recorded by hawk. Here is a typical sampling of inbound aggregate > > requests received vs. those actually handled: > > > > hour received handled > > > > 1013486400 547 525 > > 1013490000 595 418 > > 1013493600 638 148 > > 1013497200 826 25 > > 1013500800 1039 3 > > 1013504400 1226 0 > > 1013508000 799 0 > > 1013511600 991 0 > > 1013515200 959 0 > > 1013518800 1313 0 > > 1013522400 1520 0 > > 1013526000 1053 0 > > 1013529600 746 0 > > 1013533200 430 0 > > 1013536800 353 0 > > 1013540400 179 0 > > 1013544000 126 0 > > 1013547600 124 0 > > 1013551200 83 0 > > 1013554800 69 0 > > 1013558400 109 0 > > 1013562000 132 0 > > 1013565600 168 0 > > 1013569200 136 0 > > 1013572800 145 0 > > 1013576400 154 0 > > > > (for the next 8 hours, other nodes finally stop trying to send any > > requests at all) > > > > As you can see, the node eventually enters a state where it QueryRejects > > 100% of incoming requests, and the network doesn't adapt to that > > very well. > > The pattern occurs repeatably every time hawk is restarted. > > > > I obtained a thread dump from hawk and found some nasty thread deadlocks Tavin, are you sure these are deadlocks and not stalls? What objects are we deadlocking on?
> > that are eating up all the threads and hence making the node think it's > > under a lot of load. The vast majority of the threads were stuck at one > > of the following two points: > > > > "PThread-104" prio=5 tid=0x80be000 nid=0x31d waiting on monitor > > [0xb19ff000..0xb19ff874] > > at java.lang.Object.wait(Native Method) > > at java.lang.Object.wait(Object.java:420) > > at > > freenet.ConnectionHandler.sendMessage(ConnectionHandler.java:375) This is where the ConnectionHandler waits for the sendLock. Isn't it normal that many threads should be waiting here under heavy load? There's also this known bug: http://sourceforge.net/tracker/index.php?func=detail&aid=493158&group_id=978&atid=100978 > > > > "PThread-102" prio=5 tid=0x80bc800 nid=0x31b waiting on monitor > > [0xb1dff000..0xb1dff874] > > at java.lang.Object.wait(Native Method) > > at java.lang.Object.wait(Object.java:420) > > at freenet.ConnectionHandler.run(ConnectionHandler.java:301) Here we are waiting for the trailing field to be received. But we aren't holding any locks. How can this be the symptom of a deadlock? How is receiveLock supposed to be notified? > > > > A smaller number were stuck here: > > > > "PThread-105" prio=5 tid=0x80bec00 nid=0x31e waiting on monitor > > [0xb17ff000..0xb17ff874] > > at java.lang.Object.wait(Native Method) > > at java.lang.Object.wait(Object.java:420) > > at > > freenet.node.ds.FSDataStoreElement.getFailureCode(FSDataStoreEleme > > nt.java:56) > > > > > > No doubt this is happening to any node left running for a while and > > thereby killing the network and resulting in a lot of DataNotFounds as > > nodes are unable to route to their first choices for a key. Recent > > network performance would seem to confirm this. > > > > Clearly, this isn't good enough for a stable release. Not only does the > > node need to be able to hold up under load, but the other nodes' routing > > should adapt more quickly to a node that always rejects requests. > > > > The first point requires more work on the threading and connection > > management code. We need to fully diagnose what's going on before messing with the code. > I have some plans for this. The second point requires > > us to think more about how to deal with application-level feedback from > > routing. > > > > Oskar's idea to use QueryRejecteds for load regulation at the > > application level was the first step. It introduced some negative > > feedback. Now we need to apply that feedback to the routing logic. > > Right now all we do is hope that hawk will drop out of the routing table > > as we accumulate references to other nodes. This is just too slow. The > > options I can see are: > > > > 1. factor QueryRejecteds into the CP (ugly, mixes layers) > > 2. introduce an application-layer probabilistic factor like CP > > (might as well just do #1) > > 3. only send N requests at a time, where N is some small integer, > > waiting for the Accepteds before sending more. bail out on all > > queued requests if we get a QueryRejected instead of Accepted. > > (arbitrary and bad for performance) > > 4. reintroduce ref deletion. when we receive a QueryRejected, delete > > the ref with a probability of 1 - 1/(no. of refs to that node). > > the probability is to prevent removing the last ref to that node. > > > > I am favoring #4. I think we should use it for timeouts as well as > > rejected requests. > > #4 looks good to me. --gj -- Freesite (0.4) freenet:SSK at npfV5XQijFkF6sXZvuO0o~kG4wEPAgM/homepage// _______________________________________________ Devl mailing list Devl at freenetproject.org http://lists.freenetproject.org/mailman/listinfo/devl
