Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
......................................................................


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS8, Line 345: controller_.status().ok() &&
             :       (!tc_response_.has_error() ||
             :        tc_response_.error().code() ==
             :         
TabletServerErrorPB::TabletServerErrorPB::ALREADY_INPROGRESS)
> so here we were not notifying the queue if everything went well? was this a
Yah, we weren't notifying the queue on success previously.  I think this tended 
to not matter because we retry every 1.5 seconds, and the next retry should get 
ALREADY_INPROGRESS, at which point the notification would happen.

I'll pull this into a named local, hopefully that will clarify.  I don't think 
it's a good idea to have it's own method, because this is the only place where 
tc_response_ should be touched.  We'd basically have to document the method as 
'only call from ProcessTabletCopyResponse'.


http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS8, Line 395: VLOG(3) << LogPrefixUnlocked() << "Peer " << uuid << " needs 
tablet copy" << THROTTLE_MSG;
> how about we increase the number of seconds to something like 10?
I demoted this log because we're now logging the result of every begin tablet 
copy request, so these aren't useful.  That change was made in tablet_peers.cc.


http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS8, Line 228: // Update the last successful communication timestamp for the 
given peer
             :   // to the current time. This should be called when a 
non-network related
             :   // error is received from the peer, indicating that it is 
alive, even if it
             :   // may not be fully up and running or able to accept updates.
> update docs
Done


http://gerrit.cloudera.org:8080/#/c/6925/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS8, Line 385: shared_ptr<TabletCopyRunnable> runnable(new 
TabletCopyRunnable(this, req, cb));
             :   Status s = tablet_copy_pool_->Submit(runnable);
             :   if (PREDICT_TRUE(s.ok())) {
             :     return;
             :   }
> wanna add some docs regarding what you mentioned in person? i.e that we don
Done


PS8, Line 405:   if (transition) {
             :     cb(Status::IllegalState(
             :           strings::Substitute("State transition of tablet $0 
already in progress: $1",
             :                               tablet_id, *transition)),
             :           TabletServerErrorPB::ALREADY_INPROGRESS);
             :     return;
             :   }
> add docs stating that we do this to call the callback outside the lock?
To the header?  The lock is an internal implementation detail of 
TSTabletManager.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Hao Hao <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to