----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/50142/#review142571 -----------------------------------------------------------
Ship it! Ship It! - Gordon Sim On July 18, 2016, 4:04 p.m., Alan Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/50142/ > ----------------------------------------------------------- > > (Updated July 18, 2016, 4:04 p.m.) > > > Review request for qpid, Gordon Sim and Ted Ross. > > > Repository: qpid-cpp > > > Description > ------- > > Fixed bug in backup recovery consistent with stack traces from a customer and > similar traces reproduced as explained in > https://bugzilla.redhat.com/show_bug.cgi?id=1333767#c19. > > The bug was in BrokerReplicator: an interrupted failover left the > UpdateTracker > in place with a queue list. The next failover created a new UpdateTracker and > deleted the old one, which triggered deletion of queues. This caused > re-entrant > deletion of bridges while Link::ioThreadProcessing is iterating over the > bridge > list, which could cause the iteration to go past the end of the list and > access > invalid memory. > > The exact circumstances are hard to reproduce, it requires multiple steps: > > 1. Backup receives queue creation events and calls to Link::add() with new > bridges, but is disconnected from primary before the > Link::ioThreadProcessing() > event. This puts the new bridges on the Link::created list *before* the > BrokerReplicator bridge. > > 2. Backup re-connects and creates an UpdateTracker with a list of all queues. > Before the update is complete, the backup is disconnected leaving the > UpdateTracker in place. > > 3. Backup re-connects again and deletes the old UpdateTracker, which deletes > all > the queues and their bridges in the Link::created vector. This means the > iterator > in Link::ioThreadProcessing now points after the end of the vector. > > Step 1 is needed because normally the BrokerReplicator is the first bridge in > the list, and deleting queues after it leaves the iterator inside bounds. > There > needs to be at least one deleted queue before the BrokerReplicator in the > list. > > There are a number of reasons a failover could be interrupted: > > - two primaries are killed in quick succession. > - primary disconnects an expected "ready" backup at the ha-backup-timeout. > - a backup misses a heartbeat during failover. > > > Diffs > ----- > > src/qpid/ha/BrokerReplicator.cpp d664f13893a075dae459fbc8d50bbb44751832ba > > Diff: https://reviews.apache.org/r/50142/diff/ > > > Testing > ------- > > Passes qpid ctest suite. No auto regression test, not able to reliabily > reproduce. Long run stress tests have not shown the bug since the patch. > > > Thanks, > > Alan Conway > >
