Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet ......................................................................
Patch Set 8: (5 comments) http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: PS8, Line 345: controller_.status().ok() && : (!tc_response_.has_error() || : tc_response_.error().code() == : TabletServerErrorPB::TabletServerErrorPB::ALREADY_INPROGRESS) > so here we were not notifying the queue if everything went well? was this a Yah, we weren't notifying the queue on success previously. I think this tended to not matter because we retry every 1.5 seconds, and the next retry should get ALREADY_INPROGRESS, at which point the notification would happen. I'll pull this into a named local, hopefully that will clarify. I don't think it's a good idea to have it's own method, because this is the only place where tc_response_ should be touched. We'd basically have to document the method as 'only call from ProcessTabletCopyResponse'. http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS8, Line 395: VLOG(3) << LogPrefixUnlocked() << "Peer " << uuid << " needs tablet copy" << THROTTLE_MSG; > how about we increase the number of seconds to something like 10? I demoted this log because we're now logging the result of every begin tablet copy request, so these aren't useful. That change was made in tablet_peers.cc. http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.h File src/kudu/consensus/consensus_queue.h: PS8, Line 228: // Update the last successful communication timestamp for the given peer : // to the current time. This should be called when a non-network related : // error is received from the peer, indicating that it is alive, even if it : // may not be fully up and running or able to accept updates. > update docs Done http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS8, Line 385: shared_ptr<TabletCopyRunnable> runnable(new TabletCopyRunnable(this, req, cb)); : Status s = tablet_copy_pool_->Submit(runnable); : if (PREDICT_TRUE(s.ok())) { : return; : } > wanna add some docs regarding what you mentioned in person? i.e that we don Done PS8, Line 405: if (transition) { : cb(Status::IllegalState( : strings::Substitute("State transition of tablet $0 already in progress: $1", : tablet_id, *transition)), : TabletServerErrorPB::ALREADY_INPROGRESS); : return; : } > add docs stating that we do this to call the callback outside the lock? To the header? The lock is an internal implementation detail of TSTabletManager. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
