This is an automated email from the ASF dual-hosted git repository. adar pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/kudu.git
commit 28c706722891d20aada5d8bee4cfafe456c89561 Author: Will Berkeley <[email protected]> AuthorDate: Fri Mar 15 14:38:38 2019 -0700 KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup The initialization of the master works as follows: 1. Register RPC services. 2. Init catalog manager asynchronously. As a result, if a master in a multimaster cluster with a healthy leader starts, there is a brief period of time when a call to UpdateConsensus from the leader master will hit a CatalogManager and SysTable that are not initialized. The initializing master will respond TABLET_NOT_FOUND to the leader, which will cause the leader master to initiate the tablet copy process. This is a dead end because masters don't support tablet copy. Things are stuck until there is a leadership change or the "orphaned" master is restarted again. Tablets on tablet servers are not vulnerable to this because their startup order is 1. Init the ts tablet manager synchronously. 2. Register RPC services. So it is not possible for an UpdateConsensus call to query a ts tablet manager that hasn't loaded all of the initial tablets. The fix is pretty simple: recognize and return the StatusUnavailable returned by the tablet lookup for the master tablet, instead of TABLET_NOT_FOUND. This will cause the leader master to retry until the initializing master has finished initializing. This was the cause of flakiness in KUDU-2734. Without the fix, about 8% of runs fail on TSAN with 8 stress threads. With the fix, about 0.3% do (and in 2000 runs with 6 failures I verified that none of the 6 were due to this issue). Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7 Reviewed-on: http://gerrit.cloudera.org:8080/12770 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo <[email protected]> Reviewed-by: Grant Henke <[email protected]> Reviewed-by: Alexey Serbin <[email protected]> --- src/kudu/tserver/tablet_service.cc | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/src/kudu/tserver/tablet_service.cc b/src/kudu/tserver/tablet_service.cc index 5d89ea1..010b35b 100644 --- a/src/kudu/tserver/tablet_service.cc +++ b/src/kudu/tserver/tablet_service.cc @@ -230,8 +230,15 @@ bool LookupTabletReplicaOrRespond(TabletReplicaLookupIf* tablet_manager, scoped_refptr<TabletReplica>* replica) { Status s = tablet_manager->GetTabletReplica(tablet_id, replica); if (PREDICT_FALSE(!s.ok())) { - SetupErrorAndRespond(resp->mutable_error(), s, - TabletServerErrorPB::TABLET_NOT_FOUND, context); + if (s.IsServiceUnavailable()) { + // If the tablet manager isn't initialized, the remote should check again + // soon. + SetupErrorAndRespond(resp->mutable_error(), s, + TabletServerErrorPB::UNKNOWN_ERROR, context); + } else { + SetupErrorAndRespond(resp->mutable_error(), s, + TabletServerErrorPB::TABLET_NOT_FOUND, context); + } return false; } return true;
