[
https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997650#comment-16997650
]
ASF subversion and git services commented on KUDU-2992:
-------------------------------------------------------
Commit 23e07e3a93ad42c1e44cd077acf0ef041b78258a in kudu's branch
refs/heads/master from zhangyifan27
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=23e07e3 ]
KUDU-2992: Avoid sending duplicated requests in catalog_manager
When the master sent a large number of rpc requests to a tserver,
and the tserver didn't response immediately, catalog manager may
continue to send duplicate requests in ProcessTabletReport(), that
would cause rpc timeouts and high load on both masters and tservers.
This patch add a tablet_id index to pending_tasks, catalog manager
would not send duplicate requests to a tablet with the same alter type
when handling a tablet report from a tablet server. If the existing
RPC task failed, catalog manager would create it the next time the
tablet server heartbeats.
This patch also fix HandleResponse() in AsyncDeleteReplica to avoid
unnecessary retries.
Change-Id: If090723821bd78e14a3c54a35cb5e471320002e9
Reviewed-on: http://gerrit.cloudera.org:8080/14849
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <[email protected]>
> Limit concurrent alter request of a table
> -----------------------------------------
>
> Key: KUDU-2992
> URL: https://issues.apache.org/jira/browse/KUDU-2992
> Project: Kudu
> Issue Type: Improvement
> Components: master
> Affects Versions: 1.11.0
> Reporter: Yingchun Lai
> Assignee: YifanZhang
> Priority: Major
> Fix For: 1.12.0
>
>
> One of our production environment clusters cause an accident some days ago,
> one user has a table, partition schema looks like:
> {code:java}
> HASH (uuid) PARTITIONS 80,RANGE (date_hour) (
> PARTITION 2019102900 <= VALUES < 2019102901,
> PARTITION 2019102901 <= VALUES < 2019102902,
> PARTITION 2019102902 <= VALUES < 2019102903,
> PARTITION 2019102903 <= VALUES < 2019102904,
> PARTITION 2019102904 <= VALUES < 2019102905,
> ...)
> {code}
> He try to remove many outdated partitions once by SparkSQL, but it returns an
> timeout error at first, then he try again and again, and SparkSQL failed
> again and again. Then the cluster became unstable, memory usage and CPU load
> increasing.
>
> I found many log like:
> {code:java}
> W1030 17:29:53.382287 7588 rpcz_store.cc:259] Trace:
> 1030 17:26:19.714799 (+ 0us) service_pool.cc:162] Inserting onto call
> queue
> 1030 17:26:19.714808 (+ 9us) service_pool.cc:221] Handling call
> 1030 17:29:53.382204 (+213667396us) ts_tablet_manager.cc:874] Deleting tablet
> c52c5f43f7884d08b07fd0005e878fed
> 1030 17:29:53.382205 (+ 1us) ts_tablet_manager.cc:794] Acquired tablet
> manager lock
> 1030 17:29:53.382208 (+ 3us) inbound_call.cc:162] Queueing success
> response
> Metrics: {"tablet-delete.queue_time_us":213667360}
> W1030 17:29:53.382300 7586 rpcz_store.cc:253] Call
> kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576
> (request call id 1820316) took 213667 ms (3.56 min). Client timeout 29999 ms
> (30 s)
> W1030 17:29:53.382292 10623 rpcz_store.cc:253] Call
> kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576
> (request call id 1820315) took 213667 ms (3.56 min). Client timeout 29999 ms
> (30 s)
> W1030 17:29:53.382297 10622 rpcz_store.cc:259] Trace:
> 1030 17:26:19.714825 (+ 0us) service_pool.cc:162] Inserting onto call
> queue
> 1030 17:26:19.714833 (+ 8us) service_pool.cc:221] Handling call
> 1030 17:29:53.382239 (+213667406us) ts_tablet_manager.cc:874] Deleting tablet
> 479f8c592f16408c830637a0129359e1
> 1030 17:29:53.382241 (+ 2us) ts_tablet_manager.cc:794] Acquired tablet
> manager lock
> 1030 17:29:53.382244 (+ 3us) inbound_call.cc:162] Queueing success
> response
> Metrics: {"tablet-delete.queue_time_us":213667378}
> ...{code}
> That means 'Acquired tablet manager lock' cost much time, right?
> {code:java}
> Status TSTabletManager::BeginReplicaStateTransition(
> const string& tablet_id,
> const string& reason,
> scoped_refptr<TabletReplica>* replica,
> scoped_refptr<TransitionInProgressDeleter>* deleter,
> TabletServerErrorPB::Code* error_code) {
> // Acquire the lock in exclusive mode as we'll add a entry to the
> // transition_in_progress_ map.
> std::lock_guard<RWMutex> lock(lock_);
> TRACE("Acquired tablet manager lock");
> RETURN_NOT_OK(CheckRunningUnlocked(error_code));
> ...
> }{code}
> But I think the root case is Kudu master send too many duplicate 'alter
> table/delete tablet' request to tserver. I found more info in master's log:
> {code:java}
> $ grep "Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d"
> kudu_master.zjy-hadoop-prc-ct01.bj.work.log.INFO.20191030-204137.62788 |
> egrep "attempt = 1\)"
> I1030 20:41:42.207222 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 43 ms (attempt = 1)
> I1030 20:41:42.207556 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 40 ms (attempt = 1)
> I1030 20:41:42.260052 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 31 ms (attempt = 1)
> I1030 20:41:42.278609 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 19 ms (attempt = 1)
> I1030 20:41:42.312175 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 48 ms (attempt = 1)
> I1030 20:41:42.318933 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 62 ms (attempt = 1)
> I1030 20:41:42.340060 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 30 ms (attempt = 1)
> ...{code}
> That means master received too many duplicate 'delete tablet' request from
> client, and then dispatch these request to tservers.
> I think we should limit the concurrent alter request of a table, when a alter
> request is on going and hasn't been finished, the following request should be
> rejected. Or we should limit the size of TableInfo::pending_tasks_.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)