Re: [freenet-dev] 6226 deadlock update, and ideas for the future

Toad Thu, 09 Oct 2003 17:19:28 -0700

What on earth are you talking about re deadlocks? What happens, and how
long does it take, and CAN YOU PLEASE GET A STACK DUMP (kill -QUIT `cat
freenet.pid` and capture stderr/stdout)? Or is it a JVM level problem 
(i.e. it doesn't produce a stack dump when it has deadlocked, but does
at other times)? In which case, have you tried a different JVM?


On Thu, Oct 09, 2003 at 10:05:09AM -0700, Mike Stump wrote:
> The good news, the numbers for trasnmitting are back, thanks.  FYI,
> there are 28 -2s in the connections table out of 138 claimed open
> connections.
> 
> Here is the back trace from inside gdb of the deadlocked thread that is
> hanging (by spinning) my freenet node:
> 
> (gdb) bt
> #0  __pthread_unlock (lock=0x808739c) at spinlock.c:85
> #1  0x4002221f in __pthread_mutex_unlock (mutex=0x808738c) at mutex.c:126
> #2  0x403407ef in ObjectMonitor::exit ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #3  0x403717da in ObjectSynchronizer::slow_exit ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #4  0x40234ca2 in Runtime1::monitorexit ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #5  0x421169a7 in ?? ()
> #6  0x422a4520 in ?? ()
> #7  0x422c8424 in ?? ()
> #8  0x42115480 in ?? ()
> #9  0x420ace04 in ?? ()
> #10 0x420ace04 in ?? ()
> #11 0x420ad140 in ?? ()
> #12 0x422c4445 in ?? ()
> #13 0x420ace04 in ?? ()
> #14 0x420ad140 in ?? ()
> #15 0x42350b14 in ?? ()
> #16 0x42117240 in ?? ()
> #17 0x420ace04 in ?? ()
> #18 0x420ad140 in ?? ()
> #19 0x420aa0f6 in ?? ()
> #20 0x402ac696 in JavaCalls::call_helper ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #21 0x40347455 in os::os_exception_wrapper ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #22 0x402ac774 in JavaCalls::call ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #23 0x402ac234 in JavaCalls::call_virtual ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #24 0x402acc1b in JavaCalls::call_virtual ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #25 0x402f49d0 in thread_entry ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #26 0x4038c867 in JavaThread::thread_main_inner ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #27 0x4038896f in JavaThread::run ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #28 0x40345ef0 in _start ()
>    from /freenet/j2re1.4.1_04/lib/i386/client/libjvm.so
> #29 0x40021567 in pthread_start_thread (arg=0xbdfffe60) at manager.c:213
> 
> It was in a loop being killed with USR2 and then suspending.  I
> watched it go through the loop 5 or so times.  After attaching, and
> while having a QUIT being sent to try and get a back trace out of java
> (which never completed), eventually someone killed or shut down the
> main task `cat freenet.pid`, and I looked, and all the other processes
> were gone, except for the one I was debugging.
> 
> Just wishing I had full source to everything and were an expert jvm
> debugger, as then I could tell you something interesting...  anyway,
> I'll include these details, as they might shed some light for someone
> else, or maybe someone else will tell me what incantation inside gdb I
> could issue to get yet more information...
> 
> This is a new deadlock for me, not present in 5028 (or at least more
> rare), but I think I've seen it going back to 6199 at least.

AFAIK nobody else has reported it. What form does it take? Is it just
not responding on the web interface for a short while? Were you heavily
browsing at the time? Freenet will shut down the web interface for a
while if it gets too many connections (well technically it limits the
number that are serviced at once and has a limited queue...).
> 
> Anyway, for those keeping tabs, here is how reliable freenet is for
> me:
> 
>         4528 Oct  8 23:29 hs_err_pid12035.log
>         4792 Oct  8 22:27 hs_err_pid10467.log
>         4528 Oct  8 21:29 hs_err_pid10066.log
>         4527 Oct  8 18:42 hs_err_pid8356.log
>         4792 Oct  8 18:33 hs_err_pid7644.log
>         4528 Oct  8 18:02 hs_err_pid6994.log
>         4792 Oct  8 15:20 hs_err_pid5181.log
>         4528 Oct  8 13:05 hs_err_pid4286.log
>         4791 Oct  8 12:21 hs_err_pid3870.log
>         4792 Oct  8 12:09 hs_err_pid2803.log
>         4792 Oct  8 11:03 hs_err_pid2093.log
>         4792 Oct  8 10:18 hs_err_pid1510.log
>         4527 Oct  8 09:40 hs_err_pid1092.log
>         4958 Oct  8 09:26 hs_err_pid660.log
>         4787 Oct  8 09:04 hs_err_pid31959.log
>         4958 Oct  8 07:11 hs_err_pid31277.log
>         4528 Oct  8 06:24 hs_err_pid30618.log
>         4641 Oct  7 00:39 hs_err_pid14248.log
>         4762 Oct  6 23:10 hs_err_pid11663.log
>         4763 Oct  6 21:27 hs_err_pid2643.log
>         4762 Oct  6 16:09 hs_err_pid1595.log
>         4763 Oct  6 15:40 hs_err_pid26762.log
>         4791 Oct  6 10:20 hs_err_pid26432.log
>         4793 Oct  6 10:12 hs_err_pid10002.log
>         4792 Oct  6 01:57 hs_err_pid8883.log
>         5749 Oct  6 01:24 hs_err_pid6616.log
>         5889 Oct  6 00:18 hs_err_pid28662.log
>         4762 Oct  5 17:41 hs_err_pid25929.log
>         4762 Oct  5 15:34 hs_err_pid24824.log
>         4786 Oct  5 14:58 hs_err_pid24651.log
>         4763 Oct  5 14:54 hs_err_pid14088.log
>         4762 Oct  5 03:51 hs_err_pid1329.log
>         4761 Oct  5 01:20 hs_err_pid27256.log
>         4762 Oct  5 01:01 hs_err_pid21759.log
>         4528 Oct  4 23:41 hs_err_pid4635.log
>         4529 Oct  4 23:03 hs_err_pid22794.log
>         4528 Oct  4 06:38 hs_err_pid20330.log
>         4792 Oct  4 05:15 hs_err_pid19090.log
>         4640 Oct  4 04:30 hs_err_pid18843.log
>         4763 Oct  4 04:22 hs_err_pid12903.log
>         4787 Oct  4 00:32 hs_err_pid22840.log
>         4793 Oct  3 21:37 hs_err_pid20165.log
>         4763 Oct  3 18:20 hs_err_pid13339.log
>         4763 Oct  3 10:32 hs_err_pid8796.log
>         5616 Oct  3 05:21 hs_err_pid5648.log
>         4786 Oct  2 23:41 hs_err_pid4033.log
>         4953 Sep 26 16:37 hs_err_pid20659.log
>         4801 Sep 26 08:53 hs_err_pid9158.log
>         4801 Sep 26 07:50 hs_err_pid7750.log
>         4801 Sep 26 00:19 hs_err_pid943.log
>         4537 Sep 25 23:24 hs_err_pid31803.log
>         4769 Sep 25 20:08 hs_err_pid27996.log
>         5605 Sep 25 16:17 hs_err_pid24173.log
>         4800 Sep 25 13:38 hs_err_pid17914.log
>         5569 Sep 25 07:35 hs_err_pid12062.log
>         4799 Sep 25 01:37 hs_err_pid10115.log
>         5857 Sep 24 13:55 hs_err_pid3575.log
>         4769 Sep 24 11:25 hs_err_pid1312.log
>         5354 Sep 24 10:06 hs_err_pid555.log
>         4795 Sep 24 09:41 hs_err_pid24786.log
>         4793 Sep 24 01:50 hs_err_pid24172.log
>         4769 Sep 24 01:34 hs_err_pid21815.log
>         4770 Sep 23 23:57 hs_err_pid17753.log
>         4769 Sep 23 19:59 hs_err_pid16187.log
>         4606 Sep 23 18:02 hs_err_pid13074.log
>         4547 Sep 23 15:41 hs_err_pid11273.log
>         4769 Sep 23 14:01 hs_err_pid10034.log
>         4580 Sep 23 07:12 hs_err_pid2732.log
>         4769 Sep 23 07:03 hs_err_pid1192.log
>         4799 Sep 23 05:56 hs_err_pid348.log
>         4649 Sep 23 01:50 hs_err_pid31807.log
>         4796 Sep 22 12:47 hs_err_pid28029.log
>         4796 Sep 22 11:25 hs_err_pid20761.log
>         4772 Sep 22 10:57 hs_err_pid18258.log
>         4796 Sep 22 03:31 hs_err_pid3268.log
>         4537 Sep 22 03:14 hs_err_pid9226.log
>         4796 Sep 22 01:28 hs_err_pid6458.log
>         4537 Sep 22 01:03 hs_err_pid31390.log
>         4771 Sep 21 23:53 hs_err_pid30324.log
>         4796 Sep 21 23:42 hs_err_pid28225.log
>         4772 Sep 21 19:02 hs_err_pid11841.log
>         4801 Sep 21 17:59 hs_err_pid30863.log
>         6671 Sep 21 15:33 hs_err_pid28632.log
>         4771 Sep 21 12:43 hs_err_pid24235.log
>         4650 Sep 21 12:29 hs_err_pid16806.log
>         4757 Sep 21 11:17 hs_err_pid13914.log
>         4795 Sep 21 11:15 hs_err_pid12986.log
>         6044 Sep 21 11:01 hs_err_pid10340.log
>         4772 Sep 21 10:20 hs_err_pid5375.log
>         4795 Sep 20 13:22 hs_err_pid700.log
>         4544 Sep 20 12:33 hs_err_pid20468.log
>         4779 Sep 20 11:27 hs_err_pid13052.log
>         4779 Sep 20 10:34 hs_err_pid6071.log
>         4544 Sep 20 00:52 hs_err_pid25962.log
>         4544 Sep 20 00:31 hs_err_pid17718.log
>         4779 Sep 19 23:31 hs_err_pid14067.log
>         4779 Sep 19 22:16 hs_err_pid8361.log
>         4808 Sep 19 21:26 hs_err_pid27022.log
>         4802 Sep 19 20:14 hs_err_pid24862.log
>         4778 Sep 19 19:44 hs_err_pid21771.log
>         4809 Sep 19 19:27 hs_err_pid1997.log
>         4802 Sep 19 07:30 hs_err_pid23561.log
>         4544 Sep 19 07:17 hs_err_pid18517.log
>         4545 Sep 19 06:07 hs_err_pid10387.log
>         4763 Sep  9 11:33 hs_err_pid13265.log
>         4762 Sep  9 06:33 hs_err_pid9799.log
>         4787 Sep  9 05:29 hs_err_pid8722.log
>         4787 Sep  9 05:10 hs_err_pid6642.log
>         4762 Sep  9 04:39 hs_err_pid3597.log
>         4792 Sep  9 03:44 hs_err_pid32528.log
>         4761 Sep  9 02:46 hs_err_pid32024.log
>         4763 Sep  8 20:21 hs_err_pid16466.log
>         4762 Sep  8 15:57 hs_err_pid12746.log
>         4762 Sep  8 13:39 hs_err_pid10803.log
>         4763 Sep  8 12:17 hs_err_pid8275.log
>         4762 Sep  8 03:42 hs_err_pid912.log
>         4787 Sep  8 02:27 hs_err_pid30491.log
>         4762 Sep  8 01:16 hs_err_pid27764.log
>         4762 Sep  7 23:14 hs_err_pid25283.log
>         4762 Sep  7 21:37 hs_err_pid21474.log
>         4787 Sep  7 19:47 hs_err_pid18134.log
>         4762 Sep  7 18:46 hs_err_pid15528.log
>         4602 Sep  7 17:47 hs_err_pid15377.log
>         5561 Sep  7 17:45 hs_err_pid12760.log
>         4792 Sep  7 15:36 hs_err_pid11776.log
>         4762 Sep  7 14:58 hs_err_pid10592.log
>         4787 Sep  7 14:06 hs_err_pid7204.log
>         4677 Sep  7 00:21 hs_err_pid15110.log
>         4528 Sep  6 23:04 hs_err_pid13011.log
>         4762 Sep  6 21:56 hs_err_pid11158.log
>         4762 Sep  6 18:15 hs_err_pid4603.log
>         5882 Sep  6 17:28 hs_err_pid29273.log
>         4762 Sep  6 12:32 hs_err_pid13808.log
>         4762 Sep  6 10:55 hs_err_pid6164.log
>         5954 Sep  6 08:22 hs_err_pid2418.log
>         4762 Sep  6 06:34 hs_err_pid32618.log
>         4637 Sep  6 06:00 hs_err_pid26723.log
>         4792 Sep  6 03:00 hs_err_pid22660.log
>         4762 Sep  6 01:32 hs_err_pid21498.log
>         5470 Sep  5 22:13 hs_err_pid16017.log
>         4787 Sep  5 21:26 hs_err_pid9388.log
>         5692 Sep  5 20:07 hs_err_pid8327.log
>         4792 Sep  5 19:49 hs_err_pid1680.log
>         4792 Sep  5 17:45 hs_err_pid714.log
>         4762 Sep  5 17:14 hs_err_pid32018.log
>         4762 Sep  5 16:34 hs_err_pid30291.log
>         4762 Sep  5 15:30 hs_err_pid28465.log
>         4761 Sep  5 14:07 hs_err_pid28252.log
>         4762 Sep  5 14:00 hs_err_pid25131.log
>         4792 Sep  5 11:22 hs_err_pid21919.log
>         4762 Sep  5 09:49 hs_err_pid19042.log
>         4762 Sep  5 07:44 hs_err_pid18395.log
>         4763 Sep  5 07:13 hs_err_pid14489.log
>         4637 Sep  5 03:18 hs_err_pid11780.log
>         4763 Sep  4 23:56 hs_err_pid4187.log
>         5883 Sep  4 16:21 hs_err_pid296.log
>         4787 Sep  4 05:20 hs_err_pid19744.log
>         6382 Aug 30 10:34 hs_err_pid32241.log
>         4763 Aug 30 10:01 hs_err_pid24146.log
>         4791 Aug 27 22:19 hs_err_pid9957.log
>         4787 Aug 27 22:08 hs_err_pid7407.log
>         4787 Aug 27 21:25 hs_err_pid4796.log
>         4792 Aug 27 14:00 hs_err_pid23650.log
>         4763 Aug 27 13:31 hs_err_pid20597.log
>         4793 Aug 26 21:30 hs_err_pid8217.log
>         4793 Aug 26 16:09 hs_err_pid28691.log
>         4763 Aug 26 09:27 hs_err_pid15365.log
>         4763 Aug 26 01:34 hs_err_pid12346.log
>         4762 Aug 25 22:10 hs_err_pid10665.log
>         4762 Aug 25 20:31 hs_err_pid8784.log
>         4762 Aug 25 17:47 hs_err_pid7983.log
>         4788 Aug 25 17:09 hs_err_pid4371.log
>         4763 Aug 25 12:53 hs_err_pid299.log
>         4787 Aug 24 22:15 hs_err_pid4018.log
>         4763 Aug 24 20:47 hs_err_pid30265.log
>         4780 Aug 24 09:27 hs_err_pid21258.log
>         4594 Aug 24 01:52 hs_err_pid18846.log
>         5665 Aug 24 00:18 hs_err_pid14549.log
>         5665 Aug 12 10:38 hs_err_pid3337.log
>         4791 Aug 12 07:44 hs_err_pid26335.log
>         4792 Aug 12 05:02 hs_err_pid18441.log
>         5699 Aug  9 01:08 hs_err_pid13643.log
>         4787 Aug  8 17:31 hs_err_pid11952.log
>         4787 Aug  8 17:04 hs_err_pid10290.log
>         4787 Aug  8 16:41 hs_err_pid7185.log
>         4787 Aug  8 15:21 hs_err_pid783.log
>         4787 Aug  8 10:57 hs_err_pid21975.log
>         4787 Aug  8 08:18 hs_err_pid18476.log
>         4540 Aug  8 07:00 hs_err_pid15211.log
>         5692 Aug  8 06:20 hs_err_pid12398.log
>         4842 Aug  8 04:31 hs_err_pid5758.log
>         5887 Aug  7 08:34 hs_err_pid20422.log
>         4792 Aug  5 10:39 hs_err_pid26685.log
>         4792 Aug  5 05:40 hs_err_pid21880.log
>         4785 Aug  5 01:01 hs_err_pid21760.log
>         4791 Aug  4 02:38 hs_err_pid14058.log
>         4786 Aug  4 01:12 hs_err_pid13334.log
>         5882 Aug  4 00:59 hs_err_pid3641.log
>         4595 Aug  3 19:11 hs_err_pid19001.log
>         4790 Aug  3 15:04 hs_err_pid18125.log
>         4528 Aug  3 14:49 hs_err_pid16394.log
>         4528 Aug  3 13:23 hs_err_pid10103.log
>         4637 Aug  3 09:23 hs_err_pid16938.log
>         4594 Aug  3 01:13 hs_err_pid15770.log
>         4787 Aug  3 00:30 hs_err_pid13129.log
>         4792 Aug  2 19:00 hs_err_pid303.log
>         5293 Aug  2 10:59 hs_err_pid22450.log

I don't get it.
> 
> gaps are the really bad errors where freenet just hangs, or takes out
> my cable modem (don't like the new cable modems from comcast) and I
> get around to kicking things, or I just have to kill it and leave it
> dead, because it is behaving so badly.  :-( MTBF was around 1 hour
> with 6223, but, at least it died hard so my restart script brought
> things back up.  It remains to be seen if 6226 will fault hard before
> deadlocking so that I can even get a MTBF.  Maybe I'll add deadlock
> checking to my restart script.  :-( I've only seen 1 hard fault in
> 6266 so far (and the one deadlock), oops, make that two, just saw the
> second one while composing this mesage.

I don't know what you mean by 'died hard'. My node is running after at
least 5 hours 40 minutes and apparently responsive.
> 
> 
> Hum, could we cache tokens in the datastore to help bring up
> connections faster and so that they are cheaper in terms of CPU?

No. There is a protocol worked out for (MUCH) cheaper connection restarts,
but it has not been implemented yet, and freenet needs debugging more
than major new features at the moment (although I may do multiplexing
before 0.6).

> Think of them as per session restart keys that change each negotiated
> restart of a session, so that the next time we start up a session to
> the node, we first try claiming that we are the same node as was here
> last time.  We could have an upper limit of say 5000 such things to
> keep long term store from growing without bound.
> 
> 
> More late breaking news:
> 
> Oct 9, 2003 3:10:01 PM (freenet.support.BlockingQueue,  write interface thread, 
> NORMAL): Waited more than 200ms to dequeue, 7 in queue, 2427 millis since enqueued 
> last item, 245 maximum waits so far - could indicate serious JVM bug.

Probably just a scheduling artifact. We don't really know what causes
these.
> 
> and more normal type errors from the log:
> 
> Oct 9, 2003 3:13:02 PM (freenet.node.states.request.DataPending, QThread-63, ERROR): 
> Got Accepted before got SendFinished on freenet.node.states.request.DataPending: 
> key=27f9549b9930c1f1460f654b34731837d89000f3120302, hopsToLive=13, 
> id=b8fdbc740aa1ec58, [EMAIL PROTECTED] 
> (27f9549b9930c1f1460f654b34731837d89000f3120302,request),[EMAIL 
> PROTECTED]@1065712382653, routedTime=1065712375069, replyTime=-1

Some bug - we are missing notifications. Known, reproduced locally, will
debug.

> Oct 9, 2003 3:13:31 PM (freenet.node.states.request.DataPending, QThread-65, ERROR): 
> Got Accepted before got SendFinished on freenet.node.states.request.DataPending: 
> key=a11b3f9a53d8cd242a997d9f4e8fea22d92e4dcb140302, hopsToLive=11, 
> id=8b30122be764caf4, [EMAIL PROTECTED] 
> (a11b3f9a53d8cd242a997d9f4e8fea22d92e4dcb140302,request),[EMAIL 
> PROTECTED]@1065712411354, routedTime=1065712403582, replyTime=-1
> Oct 9, 2003 3:14:00 PM (freenet.node.states.request.DataPending, QThread-46, ERROR): 
> Got Accepted before got SendFinished on freenet.node.states.request.DataPending: 
> key=2f926ec73229fcf69ac39d89f3d1c814dda59403120302, hopsToLive=6, 
> id=9d0e94545cceb816, [EMAIL PROTECTED] 
> (2f926ec73229fcf69ac39d89f3d1c814dda59403120302,request),[EMAIL 
> PROTECTED]@1065712440731, routedTime=1065712435748, replyTime=-1
> 
> 
> Another screwy idea, could we cache all query answers (other than
> RNF/DNF/QR) in the data store and return these without doing the quite
> as much work (queries to other nodes), maybe timing them out after 1-2
> months.  The idea is that it takes (Highest global time estimate
> 289412ms) 4.8 minutes to find data now, and if someone asks the same
> question in 1 minute, we spend another 4.8 minutes even though I have

Highest global time estimate only includes successful queries, and
includes the time to transfer the entire file.

> 61 gigs free in my data store that could be used to remember this
> stuff.  This would lower the load my node places on other nodes, and
> the load they place on me and speed the answers I give them.  If you
> want better quality answers, I could contact the referenced node and
> query it for the data if the cached answer is more then, say 3 days
> old.  Then, if they say they still have it, presto, update the
> timestamp on my cached answer, and give answer (in usual way (maybe
> reset datas source)) to the requester.
> 
> I could even preload answers to queries that haven't been issued yet
> be remembering all keys I give to other nodes.  When I am overloaded
> (or just want to confuse who had what data when), and someone asks me
> for a key I have, I could answer the query with the cached answer.
> Notice, I can then have dozens of answers for any individual key.  The
> more popular a key (the more I give it out), the more offload capacity
> for the key I have.  The need to query reject because I'm overloaded
> should ease, and if the protocol supports it, I could even give back
> 2-4 answers for the query to improve their odds of finding the data
> quickly.
> 
> In fact, inserts could be a sort of inverse query where you say, I
> have (or can get) key X, and you hand that out to all your friends,
> and they to theirs, until the TTL expires and presto, many nodes can
> find it, all with a small datagram, instead of having to move the full
> weight of the data around.  It will move when queried anyway.  It
> should be as anonymous as a normal insert does the same exact thing,
> except it moves the data.  We could even do one of the two methods
> with some type of probability, the idea being that one can insert, say
> 40 deep, with a per hop probability of a real insert of 1/20, and wind
> up with three nodes that actually have the data in the end, the
> originator, and two others.  Or, maybe we insert 100 deep and have 6
> copies.

See above. And read the paper explaining the protocol.
> 
> 
> And the last idea for the day, could we have two nodes on connection
> open select one side as closer to the root, and form a spanning tree
> across the whole network in which once an hour we tell the root what
> the max radius to any of our leaves is, sum of bytes in and out for us
> and people downstream, sum of the count of keys in the cache for us
> and people down stream, sum of number of nodes downstream (+ 1 for
> us), and other such statistically vague information, and once an hour
> the root tells all the people downstream who the current root is, and
> all the global stats.  By once an hour, I don't mean at x:00, I mean
> at 60+rand(1)*5 seconds from the last time we did it, this prevent
> thumping the network in unison.  I think it would be good to have some
> sense of the size of the network.

Umm, why? And how?
> 
> The definition of closer to the root can be, node with the longest
> continuous uptime, in event of tie (unlikely), < node id (DSA).

-- 
Matthew J Toseland - [EMAIL PROTECTED]
Freenet Project Official Codemonkey - http://freenetproject.org/
ICTHUS - Nothing is impossible. Our Boss says so.

signature.asc
Description: Digital signature

_______________________________________________
Devl mailing list
[EMAIL PROTECTED]
http://dodo.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] 6226 deadlock update, and ideas for the future

Reply via email to