Repository: kudu
Updated Branches:
  refs/heads/master 034b2af86 -> f0d52ab64


consensus_peers-test: fix flake in TSAN builds

TSAN builds were significantly flaky with the following error:

==8581==ERROR: AddressSanitizer: heap-use-after-free on address 0x615000003fe8 
at pc 0x7fde311d3229 bp 0x7fde16b56a10 sp 0x7fde16b56a08
READ of size 8 at 0x615000003fe8 thread T12 (test-raft-pool )
    #0 0x7fde311d3228 in std::_Hashtable<std::string, std::pair<std::string 
const, kudu::consensus::PeerMessageQueue::TrackedPeer*>, 
std::allocator<std::pair<std::string const, 
kudu::consensus::PeerMessageQueue::TrackedPeer*> >, std::__detail::_Se
    #1 0x7fde311d4c68 in std::_Hashtable<std::string, std::pair<std::string 
const, kudu::consensus::PeerMessageQueue::TrackedPeer*>, 
std::allocator<std::pair<std::string const, 
kudu::consensus::PeerMessageQueue::TrackedPeer*> >, std::__detail::_Se
    #2 0x7fde311c45bc in std::unordered_map<std::string, 
kudu::consensus::PeerMessageQueue::TrackedPeer*, std::hash<std::string>, 
std::equal_to<std::string>, std::allocator<std::pair<std::string const, 
kudu::consensus::PeerMessageQueue::TrackedPee
    #3 0x7fde311b55e4 in 
kudu::consensus::PeerMessageQueue::UntrackPeer(std::string const&) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus_queue.cc:241:23
    #4 0x7fde311a580c in kudu::consensus::Peer::Close() 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus_peers.cc:446:11
    #5 0x7fde311a599a in kudu::consensus::Peer::~Peer() 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus_peers.cc:450:3
    #6 0x7fde311af50d in std::_Sp_counted_ptr<kudu::consensus::Peer*, 
(__gnu_cxx::_Lock_policy)2>::_M_dispose() 
/opt/rh/devtoolset-3/root/usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/bits/shared_ptr_base.h:373:9
    #7 0x560f75 in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() 
/opt/rh/devtoolset-3/root/usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/bits/shared_ptr_base.h:149:6
    #9 0x569231 in boost::function0<void>::~function0() 
/data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/uninstrumented/include/boost/function/function_template.hpp:763:34
    #10 0x568777 in 
kudu::consensus::TestPeerProxy::Respond(kudu::consensus::TestPeerProxy::Method) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus-test-util.h:157:3
    #11 0x56f740 in 
kudu::consensus::DelayablePeerProxy<kudu::consensus::NoOpTestPeerProxy>::RespondUnlessDelayed(kudu::consensus::TestPeerProxy::Method)
 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus-test-util.h:1
    #17 0x7fde2689fb3e in kudu::Thread::SuperviseThread(void*) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/util/thread.cc:624:3

0x615000003fe8 is located 232 bytes inside of 488-byte region 
[0x615000003f00,0x6150000040e8)
freed by thread T0 here:
    #0 0x551210 in operator delete(void*) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/src/llvm-4.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cc:126
    #1 0x55c522 in kudu::consensus::ConsensusPeersTest::~ConsensusPeersTest() 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus_peers-test.cc:74:7
    #2 0x55ad89 in 
kudu::consensus::ConsensusPeersTest_TestRemotePeer_Test::~ConsensusPeersTest_TestRemotePeer_Test()
 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus_peers-test.cc:174:1

previously allocated by thread T0 here:
    #0 0x5504d0 in operator new(unsigned long) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/src/llvm-4.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cc:82
    #1 0x55b3e8 in kudu::consensus::ConsensusPeersTest::SetUp() 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus_peers-test.cc:100:26

Specifically, the thread destructor was destructing the ConsensusQueue before
letting tasks drain from the "raft_pool_". One of those tasks might have been
associated with one of the peers that is owned by the queue.

This just adds the appropriate Wait() call to let the tasks drain before
destruction.

To test I ran:

./build-support/dist_test.py loop -n 100 build/latest/bin/consensus_peers-test 
--gtest_filter=\*RemotePeer\* --stress-cpu-threads=8 --gtest_repeat=100

Prior to this fix, 100/100 failed[1]. After the fix, 100/100 succeeded[2].

[1] http://dist-test.cloudera.org/job?job_id=todd.1511292638.95033
[2] http://dist-test.cloudera.org/job?job_id=todd.1511292986.102150

Change-Id: I33f5e2da3d2d6c275ece20265b6a455e2c4b967d
Reviewed-on: http://gerrit.cloudera.org:8080/8625
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/f0d52ab6
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/f0d52ab6
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/f0d52ab6

Branch: refs/heads/master
Commit: f0d52ab64404156bab73eca9abcc0e982ee79eee
Parents: 034b2af
Author: Todd Lipcon <[email protected]>
Authored: Tue Nov 21 11:42:55 2017 -0800
Committer: Mike Percy <[email protected]>
Committed: Tue Nov 21 20:51:16 2017 +0000

----------------------------------------------------------------------
 src/kudu/consensus/consensus_peers-test.cc | 5 +++++
 1 file changed, 5 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/f0d52ab6/src/kudu/consensus/consensus_peers-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/consensus_peers-test.cc 
b/src/kudu/consensus/consensus_peers-test.cc
index e69dc56..7f4eb57 100644
--- a/src/kudu/consensus/consensus_peers-test.cc
+++ b/src/kudu/consensus/consensus_peers-test.cc
@@ -114,6 +114,11 @@ class ConsensusPeersTest : public KuduTest {
   virtual void TearDown() OVERRIDE {
     ASSERT_OK(log_->WaitUntilAllFlushed());
     messenger_->Shutdown();
+    if (raft_pool_) {
+      // Make sure to drain any tasks from the pool we're using for our 
delayable
+      // proxy before destructing the queue.
+      raft_pool_->Wait();
+    }
   }
 
   DelayablePeerProxy<NoOpTestPeerProxy>* NewRemotePeer(

Reply via email to