[freenet-devl] roadmap to 0.5, regulating and adapting to load, thread-lock bugs

Gianni Johansson Wed, 13 Feb 2002 13:28:47 -0500

> > -----Original Message-----
> > From: devl-admin at freenetproject.org
> > [mailto:devl-admin at freenetproject.org]On Behalf Of Tavin Cole
> > Sent: Wednesday, February 13, 2002 9:50 AM
> > To: devl at freenetproject.org
> > Subject: [freenet-devl] roadmap to 0.5, regulating and adapting to load,
> > thread-lock bugs
> >
> >
> > I have been observing the functioning of the network and the diagnostics
> > recorded by hawk.  Here is a typical sampling of inbound aggregate
> > requests received vs. those actually handled:
> >
> > hour        received    handled
> >
> > 1013486400  547         525
> > 1013490000  595         418
> > 1013493600  638         148
> > 1013497200  826         25
> > 1013500800  1039        3
> > 1013504400  1226        0
> > 1013508000  799         0
> > 1013511600  991         0
> > 1013515200  959         0
> > 1013518800  1313        0
> > 1013522400  1520        0
> > 1013526000  1053        0
> > 1013529600  746         0
> > 1013533200  430         0
> > 1013536800  353         0
> > 1013540400  179         0
> > 1013544000  126         0
> > 1013547600  124         0
> > 1013551200  83          0
> > 1013554800  69          0
> > 1013558400  109         0
> > 1013562000  132         0
> > 1013565600  168         0
> > 1013569200  136         0
> > 1013572800  145         0
> > 1013576400  154         0
> >
> > (for the next 8 hours, other nodes finally stop trying to send any
> >  requests at all)
> >
> > As you can see, the node eventually enters a state where it QueryRejects
> > 100% of incoming requests, and the network doesn't adapt to that
> > very well.
> > The pattern occurs repeatably every time hawk is restarted.
> >
> > I obtained a thread dump from hawk and found some nasty thread deadlocks
Tavin, are you sure these are deadlocks and not stalls?  What objects are we 
deadlocking on?


> > that are eating up all the threads and hence making the node think it's
> > under a lot of load.  The vast majority of the threads were stuck at one
> > of the following two points:
> >
> > "PThread-104" prio=5 tid=0x80be000 nid=0x31d waiting on monitor
> > [0xb19ff000..0xb19ff874]
> >         at java.lang.Object.wait(Native Method)
> >         at java.lang.Object.wait(Object.java:420)
> >         at
> > freenet.ConnectionHandler.sendMessage(ConnectionHandler.java:375)
This is where the ConnectionHandler waits for the sendLock.

Isn't it normal that many threads should be waiting here under heavy load?

There's also this known bug:
http://sourceforge.net/tracker/index.php?func=detail&aid=493158&group_id=978&atid=100978


> >
> > "PThread-102" prio=5 tid=0x80bc800 nid=0x31b waiting on monitor
> > [0xb1dff000..0xb1dff874]
> >         at java.lang.Object.wait(Native Method)
> >         at java.lang.Object.wait(Object.java:420)
> >         at freenet.ConnectionHandler.run(ConnectionHandler.java:301)
Here we are waiting for the trailing field to be received.

But we aren't holding any locks.  How can this be the symptom of a deadlock?

How is receiveLock supposed to be notified?

> >
> > A smaller number were stuck here:
> >
> > "PThread-105" prio=5 tid=0x80bec00 nid=0x31e waiting on monitor
> > [0xb17ff000..0xb17ff874]
> >         at java.lang.Object.wait(Native Method)
> >         at java.lang.Object.wait(Object.java:420)
> >         at
> > freenet.node.ds.FSDataStoreElement.getFailureCode(FSDataStoreEleme
> > nt.java:56)
> >
> >
> > No doubt this is happening to any node left running for a while and
> > thereby killing the network and resulting in a lot of DataNotFounds as
> > nodes are unable to route to their first choices for a key.  Recent
> > network performance would seem to confirm this.
> >
> > Clearly, this isn't good enough for a stable release.  Not only does the
> > node need to be able to hold up under load, but the other nodes' routing
> > should adapt more quickly to a node that always rejects requests.
> >
> > The first point requires more work on the threading and connection
> > management code.
We need to fully diagnose what's going on before  messing with the code.

>  I have some plans for this.  The second point requires
> > us to think more about how to deal with application-level feedback from
> > routing.
> >
> > Oskar's idea to use QueryRejecteds for load regulation at the
> > application level was the first step.  It introduced some negative
> > feedback.  Now we need to apply that feedback to the routing logic.
> > Right now all we do is hope that hawk will drop out of the routing table
> > as we accumulate references to other nodes.  This is just too slow.  The
> > options I can see are:
> >
> > 1. factor QueryRejecteds into the CP (ugly, mixes layers)
> > 2. introduce an application-layer probabilistic factor like CP
> >    (might as well just do #1)
> > 3. only send N requests at a time, where N is some small integer,
> >    waiting for the Accepteds before sending more.  bail out on all
> >    queued requests if we get a QueryRejected instead of Accepted.
> >    (arbitrary and bad for performance)
> > 4. reintroduce ref deletion.  when we receive a QueryRejected, delete
> >    the ref with a probability of 1 - 1/(no. of refs to that node).
> >    the probability is to prevent removing the last ref to that node.
> >
> > I am favoring #4.  I think we should use it for timeouts as well as
> > rejected requests.
> >
#4 looks good to me. 

--gj

-- 
Freesite
(0.4) freenet:SSK at npfV5XQijFkF6sXZvuO0o~kG4wEPAgM/homepage//

_______________________________________________
Devl mailing list
Devl at freenetproject.org
http://lists.freenetproject.org/mailman/listinfo/devl

[freenet-devl] roadmap to 0.5, regulating and adapting to load, thread-lock bugs

Reply via email to