David Ribeiro Alves has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet ......................................................................
Patch Set 8: (5 comments) http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: PS8, Line 345: controller_.status().ok() && : (!tc_response_.has_error() || : tc_response_.error().code() == : TabletServerErrorPB::TabletServerErrorPB::ALREADY_INPROGRESS) so here we were not notifying the queue if everything went well? was this a bug or did we have other means to notify the queue (like with heartbeats) also could pull this to another method to make it more tracktable? http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS8, Line 395: VLOG(3) << LogPrefixUnlocked() << "Peer " << uuid << " needs tablet copy" << THROTTLE_MSG; how about we increase the number of seconds to something like 10? http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.h File src/kudu/consensus/consensus_queue.h: PS8, Line 228: // Update the last successful communication timestamp for the given peer : // to the current time. This should be called when a non-network related : // error is received from the peer, indicating that it is alive, even if it : // may not be fully up and running or able to accept updates. update docs http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS8, Line 385: shared_ptr<TabletCopyRunnable> runnable(new TabletCopyRunnable(this, req, cb)); : Status s = tablet_copy_pool_->Submit(runnable); : if (PREDICT_TRUE(s.ok())) { : return; : } wanna add some docs regarding what you mentioned in person? i.e that we don't need check if the tablet is already in transition because that is already handled somewhere? PS8, Line 405: if (transition) { : cb(Status::IllegalState( : strings::Substitute("State transition of tablet $0 already in progress: $1", : tablet_id, *transition)), : TabletServerErrorPB::ALREADY_INPROGRESS); : return; : } add docs stating that we do this to call the callback outside the lock? -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
