-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50142/
-----------------------------------------------------------
Review request for qpid, Gordon Sim and Ted Ross.
Repository: qpid-cpp
Description
-------
Fixed bug in backup recovery consistent with stack traces from a customer and
similar traces reproduced as explained in
https://bugzilla.redhat.com/show_bug.cgi?id=1333767#c19.
The bug was in BrokerReplicator: an interrupted failover left the UpdateTracker
in place with a queue list. The next failover created a new UpdateTracker and
deleted the old one, which triggered deletion of queues. This caused re-entrant
deletion of bridges while Link::ioThreadProcessing is iterating over the bridge
list, which could cause the iteration to go past the end of the list and access
invalid memory.
The exact circumstances are hard to reproduce, it requires multiple steps:
1. Backup receives queue creation events and calls to Link::add() with new
bridges, but is disconnected from primary before the Link::ioThreadProcessing()
event. This puts the new bridges on the Link::created list *before* the
BrokerReplicator bridge.
2. Backup re-connects and creates an UpdateTracker with a list of all queues.
Before the update is complete, the backup is disconnected leaving the
UpdateTracker in place.
3. Backup re-connects again and deletes the old UpdateTracker, which deletes all
the queues and their bridges in the Link::created vector. This means the
iterator
in Link::ioThreadProcessing now points after the end of the vector.
Step 1 is needed because normally the BrokerReplicator is the first bridge in
the list, and deleting queues after it leaves the iterator inside bounds. There
needs to be at least one deleted queue before the BrokerReplicator in the list.
There are a number of reasons a failover could be interrupted:
- two primaries are killed in quick succession.
- primary disconnects an expected "ready" backup at the ha-backup-timeout.
- a backup misses a heartbeat during failover.
Diffs
-----
src/qpid/ha/BrokerReplicator.cpp d664f13893a075dae459fbc8d50bbb44751832ba
Diff: https://reviews.apache.org/r/50142/diff/
Testing
-------
Passes qpid ctest suite. No auto regression test, not able to reliabily
reproduce. Long run stress tests have not shown the bug since the patch.
Thanks,
Alan Conway