Basically what is happening is that the LogicalConnection.close() is being called which attempts to recycle the physical connection by calling ClientPooledConnection.recycleConnection(). At the same time ClientPooledConnection.close() is being called which attempts to call LogicalConnection.nullPhysicalConnection(). The first thread holds a lock on LogicalConnection and needs the lock on ClientPooledConnection and the second thread holds a lock on ClientPolledConnection and needs a lock on LogicalConnection and a deadlock occurs.
That definitely seems like a bad design; your description is quite clear and makes the problem really stand out. Thanks for all the hard work on this problem! I believe that the surest means to avoid such problems is to establish and keep to a single well-defined order of synchronization. My intuition is that the proper order should be logical connection first, physical connection second; do you have a sense for whether there are any other places where we try to move in the other direction? It is often useful to catalog the current synchronization behaviors; sometimes I simply insert some debugging code into the Derby libraries to capture those behaviors and run the code to observe what is occurring. However we do it, having a nice table of code paths where we lock these objects, and the order in which we lock them, would help us determine what order is followed by the current code in most cases. Once we are clear on what the correct order of synchronization should be, there are essentially two techniques for repairing code which is violating the ordering (i.e., code which tries to lock the primary object while holding the lock on the secondary object): 1) Change some method higher in the call stack so that it locks the primary object first, before calling this problematic method on the secondary object. 2) Change the problematic method itself, so it doesn't try to lock the primary object. So, e.g., if we determine that the problem is where ClientPooledConnection.close() calls LogicalConnection.nullPhysicalConnection, then we could either: 1) Change the caller of ClientPooledConnection.close() to first lock the LogicalConnection before calling close(), or 2) Change ClientPooledConnection.close() so that it doesn't call nullPhysicalConnection() Can you provide the complete stack traces of the problematic deadlock when it occurs? thanks, bryan