I did some stack trace eyeballing and did a mini-audit of the LinkedBlockingDeque code, with a view to finding possible bugs, and came up empty. Maybe it's a deep bug in hotspot?
Ariel, it would be good if you could get a reproducible test case soonish, while someone on the planet has the motivation and familiarity to fix it. In another month I may disavow all knowledge of j.u.c.*Blocking* Martin On Wed, Jul 8, 2009 at 15:57, Ariel Weisberg <ar...@weisberg.ws> wrote: > Hi, > > > The poll()ing thread is blocked waiting for the internal lock, but > > there's > > no indication of any thread owning that lock. You're using an OpenJDK 6 > > build ... can you try JDK7 ? > > I got a chance to do that today. I downloaded JDK 7 from > > http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul_2009.bin > and was able to reproduce the problem. I have attached the stack trace > from running the 1.7 version. It is the same situation as before except > there are 9 execution sites running on each host. There are no threads > that are missing or that have been restarted. Foo Network thread > (selector thread) and Network Thread - 0 are waiting on > 0x00002aaab43d3b28. I also ran with JDK 7 and 6 and LinkedBlockingQueue > and was not able to recreate the problem using that structure. > > > I don't recall anything similar to this, but I don't know what version > > that > > OpenJDK6 build relates to. > > The cluster is running on CentOS 5.3. > >[aweisb...@3f ~]$ rpm -qi java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5 > >Name : java-1.6.0-openjdk Relocations: (not relocatable) > >Version : 1.6.0.0 Vendor: CentOS > >Release : 0.30.b09.el5 Build Date: Tue 07 Apr 2009 > 07:24:52 PM EDT > >Install Date: Thu 11 Jun 2009 03:27:46 PM EDT Build Host: > builder10.centos.org > >Group : Development/Languages Source RPM: > java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5.src.rpm > >Size : 76336266 License: GPLv2 with > exceptions > >Signature : DSA/SHA1, Wed 08 Apr 2009 07:55:13 AM EDT, Key ID > a8a447dce8562897 > >URL : http://icedtea.classpath.org/ > >Summary : OpenJDK Runtime Environment > >Description : > >The OpenJDK runtime environment. > > > Make sure you haven't missed any exceptions occurring in other threads. > There are no threads missing in the application (terminated threads are > not replaced) and there is a try catch pair (prints error and rethrows) > around the run loop of each thread. It is possible that an exception may > have been swallowed up somewhere. > > >A small reproducible test case from you would be useful. > I am working on that. I wrote a test case that mimics the application's > use of the LBD, but I have not succeeded in reproducing the problem in > the test case. The app has a single thread (network selector) that polls > the LBD and several threads (ExecutionSites, and network threads that > return results from remote ExecutionSites) that offer results into the > queue. About 120k items will go into/out of the deque each second. In > the actual app the problem is reproducible but inconsistent. If I run on > my dual core laptop I can't reproduce it, and it is less likely to occur > with a small cluster, but with 6 nodes (~560k transactions/sec) the > problem will usually appear. Sometimes the cluster will run for several > minutes without issue and other times it will deadlock immediately. > > Thanks, > > Ariel > > On Wed, 08 Jul 2009 05:14 +1000, "Martin Buchholz" > <marti...@google.com> wrote: > >[+core-libs-dev] > > > >Doug Lea and I are (slowly) working on a new version of > LinkedBlockingDeque. > >I was not aware of a deadlock but can vaguely imagine how it might happen. > >A small reproducible test case from you would be useful. > > > >Unfinished work in progress can be found here: > >http://cr.openjdk.java.net/~martin/webrevs/openjdk7/BlockingQueue/<http://cr.openjdk.java.net/%7Emartin/webrevs/openjdk7/BlockingQueue/> > > > >Martin > > On Wed, 08 Jul 2009 05:14 +1000, "David Holmes" > <davidchol...@aapt.net.au> wrote: > > > > Ariel, > > > > The poll()ing thread is blocked waiting for the internal lock, but > > there's > > no indication of any thread owning that lock. You're using an OpenJDK 6 > > build ... can you try JDK7 ? > > > > I don't recall anything similar to this, but I don't know what version > > that > > OpenJDK6 build relates to. > > > > Make sure you haven't missed any exceptions occurring in other threads. > > > > David Holmes > > > > > -----Original Message----- > > > From: concurrency-interest-boun...@cs.oswego.edu > > > [mailto:concurrency-interest-boun...@cs.oswego.edu]on Behalf Of Ariel > > > Weisberg > > > Sent: Wednesday, 8 July 2009 8:31 AM > > > To: concurrency-inter...@cs.oswego.edu > > > Subject: [concurrency-interest] LinkedBlockingDeque deadlock? > > > > > > > > > Hi all, > > > > > > I did a search on LinkedBlockingDeque and didn't find anything similar > > > to what I am seeing. Attached is the stack trace from an application > > > that is deadlocked with three threads waiting for 0x00002aaab3e91080 > > > (threads "ExecutionSite: 26", "ExecutionSite:27", and "Network > > > Selector"). The execution sites are attempting to offer results to the > > > deque and the network thread is trying to poll for them using the > > > non-blocking version of poll. I am seeing the network thread never > > > return from poll (straight poll()). Do my eyes deceive me? > > > > > > Thanks, > > > > > > Ariel Weisberg > > > > > >