Adar Dembo has posted comments on this change.

Change subject: WIP: [catalog manager] fixed deadlock on catalog shutdown
......................................................................


Patch Set 2:

(5 comments)

After thinking about this some more, I think this approach is too brittle. My 
concern is that by allowing leader election callbacks concurrently with tablet 
peer shutdown, any change to how TabletPeer::Shutdown() cleans up state can 
lead to invalid memory accesses in the leader election callback.

For example, someone changes TabletPeer::Shutdown() to destroy its transaction 
tracker. If we're lucky, we'll see a SIGSEGV in a unit test when the leader 
election callback accesses the transaction tracker. If we're not, we'll get 
flaky tests with hard-to-debug symptoms.

So, I'm more inclined to keep the shutdown order as-is, and explore shutting 
down just the consensus state machine early, in order to abort all outstanding 
transactions. Or, alternatively, explore moving more object cleanup out of 
TabletPeer::Shutdown() and into ~TabletPeer() (right now Shutdown() also 
destroys the Consensus and Tablet objects).

http://gerrit.cloudera.org:8080/#/c/6134/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 722:   ConsensusStatePB cstate = 
consensus->ConsensusState(CONSENSUS_CONFIG_COMMITTED);
Need to null check here.


Line 996:   return sys_catalog_->tablet_peer_->shared_consensus()->role();
Do we need to null check here? When is Role() invoked?


Line 1027:   // Shut down the underlying storage for tables and tablets. This 
aborts
There should be a larger explanation here justifying why we're allowing the 
table visitors to run rampant on potentially half-uninitialized memory.


Line 3350:       // Nevertheless, returning OK here.
How will CheckIfFatalLeaderError() work on this function if we return OK?


Line 3981:   ConsensusStatePB cstate = 
consensus->ConsensusState(CONSENSUS_CONFIG_COMMITTED);
Do we need to null check here?


-- 
To view, visit http://gerrit.cloudera.org:8080/6134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I10ad66fe33d4696adf2a02a09e2790afa8869583
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to