> On Aug. 7, 2012, 4:51 p.m., Alan Conway wrote: > > Where is "the assignment is done outside of the lock" in the old code? The > > assignment in notifyConnection seems to have the same locking in both old > > and new code. > > Kenneth Giusti wrote: > You're correct, the call to link::established() still happens outside the > lock (it has to, to avoid lock inversion). > > What the patch does is prevent simultainous calls into > LinkRegistry::notifyConnection (from parallel threads) from being able to > mistakenly select the _same_ pending Link. It does this by holding all > "pending" Links in a container, and removing them - while locked - as they > are selected. Notice the call to pendingLinks.erase() while the lock is held. > > This prevents two threads from selecting the same pending Link. > > Alan Conway wrote: > So it is possible for multiple connections to match the same link? In > which case, how do you know you've selected the correct one? > > Kenneth Giusti wrote: > Yes - it is possible to have multiple links running between the same two > brokers. You can create such links via management (QMF) by assigning unique > names (since they will have the same remote address). > > The link really doesn't care which connection it gets as long as the > connection is to the right destination. So if two links to the same > destination request connections at the same moment, it really doesn't matter > which link gets assigned which connection. > > As an aside: a -much- better solution would be to have the Link that > requested the connection get notified of the connection directly, rather > having the notification go through the LinkRegistry, which then has to find a > Link that is pending a connection to the remote. The LinkRegistry really > shouldn't be in the middle of the connection assignment. I looked at doing > that awhile back, but the implementation started getting complex....
OK, gotcha. Agreed with your aside. Ship it:) - Alan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6396/#review9965 ----------------------------------------------------------- On Aug. 6, 2012, 5:51 p.m., Kenneth Giusti wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/6396/ > ----------------------------------------------------------- > > (Updated Aug. 6, 2012, 5:51 p.m.) > > > Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross. > > > Description > ------- > > Occasionally, the cluster tests will fail in the > test_federation_multilink_failover test with the following error: > > cluster_tests.ShortTests.test_federation_multilink_failover > ................................................................................................................... > fail > Error during test: Traceback (most recent call last): > File > "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test", > line 340, in run > phase() > File > "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", > line 992, in test_federation_multilink_failover > assert self._verify_federation(src_cluster[1], "FedX/two", > dst_cluster[1], "destQ2") > AssertionError > > The problem is due to a race condition in the LinkRegistry code. When a new > connection event occurs for a federation Link, the LinkRegistry attempts to > find a Link instance that is attempting to connect to the remote in order to > assign the connection. The problem is due to the fact that the search for > the target link is done under a lock, but the assignment is done outside of > the lock (to prevent lock inversion). > > The proposed fix has LinkRegistry hold all disconnected Links in a separate > container, and perform the search of that container (and the removal on > match) while holding a lock. > > > This addresses bug QPID-4193. > https://issues.apache.org/jira/browse/QPID-4193 > > > Diffs > ----- > > /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 > /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 > /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 > /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 > > Diff: https://reviews.apache.org/r/6396/diff/ > > > Testing > ------- > > Federation and cluster unit tests. > Ran test_federation_multilink_failover repeatedly with no crash. > > > Thanks, > > Kenneth Giusti > >
