> On Aug. 7, 2012, 4:51 p.m., Alan Conway wrote:
> > Where is "the assignment is done outside of the lock" in the old code? The 
> > assignment in notifyConnection seems to have the same locking in both old 
> > and new code.

You're correct, the call to link::established() still happens outside the lock 
(it has to, to avoid lock inversion).

What the patch does is prevent simultainous calls into 
LinkRegistry::notifyConnection (from parallel threads) from being able to 
mistakenly select the _same_ pending Link.  It does this by holding all 
"pending" Links in a container, and removing them - while locked - as they are 
selected.  Notice the call to pendingLinks.erase() while the lock is held.

This prevents two threads from selecting the same pending Link.


- Kenneth


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6396/#review9965
-----------------------------------------------------------


On Aug. 6, 2012, 5:51 p.m., Kenneth Giusti wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6396/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 5:51 p.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross.
> 
> 
> Description
> -------
> 
> Occasionally, the cluster tests will fail in the 
> test_federation_multilink_failover test with the following error:
> 
> cluster_tests.ShortTests.test_federation_multilink_failover 
> ...................................................................................................................
>  fail
> Error during test:  Traceback (most recent call last):
>     File 
> "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test",
>  line 340, in run
>       phase()
>     File 
> "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", 
> line 992, in test_federation_multilink_failover
>       assert self._verify_federation(src_cluster[1], "FedX/two", 
> dst_cluster[1], "destQ2")
>   AssertionError
> 
> The problem is due to a race condition in the LinkRegistry code.  When a new 
> connection event occurs for a federation Link, the LinkRegistry attempts to 
> find a Link instance that is attempting to connect to the remote in order to 
> assign the connection.  The problem is due to the fact that the search for 
> the target link is done under a lock, but the assignment is done outside of 
> the lock (to prevent lock inversion).
> 
> The proposed fix has LinkRegistry hold all disconnected Links in a separate 
> container, and perform the search of that container (and the removal on 
> match) while holding a lock.
> 
> 
> This addresses bug QPID-4193.
>     https://issues.apache.org/jira/browse/QPID-4193
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 
> 
> Diff: https://reviews.apache.org/r/6396/diff/
> 
> 
> Testing
> -------
> 
> Federation and cluster unit tests.
> Ran test_federation_multilink_failover repeatedly with no crash.
> 
> 
> Thanks,
> 
> Kenneth Giusti
> 
>

Reply via email to