Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet ......................................................................
Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/6925/2/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS2, Line 395: // The thread pool is at capacity. Check if the tablet is already in : // transition (i.e. being copied). : boost::optional<string> transition; : { : std::lock_guard<rw_spinlock> lock(lock_); : auto* t = FindOrNull(transition_in_progress_, tablet_id); : if (t) { : transition = *t; : } : } > +1 The 'happy path' in this case is that the thread pool is not oversubscribed. In that case the tablet copy immediately gets a thread, and as part of initializing, it already checks that there isn't a copy in progress. So, if we put the check up front, it would actually happen twice for the fast path. PS2, Line 406: cb(Status::IllegalState( : strings::Substitute("State transition of tablet $0 already in progress: $1", : tablet_id, *transition)), : TabletServerErrorPB::ALREADY_INPROGRESS); > It should get logged by the leader making the remote call. Yes, I've beefed up the logging of these errors on the leader. Going to do another cluster test to make sure it's not overwhelming. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
