[
https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
YifanZhang reassigned KUDU-2992:
--------------------------------
Assignee: YifanZhang
> Limit concurrent alter request of a table
> -----------------------------------------
>
> Key: KUDU-2992
> URL: https://issues.apache.org/jira/browse/KUDU-2992
> Project: Kudu
> Issue Type: Improvement
> Components: master
> Reporter: Yingchun Lai
> Assignee: YifanZhang
> Priority: Major
>
> One of our production environment clusters cause an accident some days ago,
> one user has a table, partition schema looks like:
> {code:java}
> HASH (uuid) PARTITIONS 80,RANGE (date_hour) (
> PARTITION 2019102900 <= VALUES < 2019102901,
> PARTITION 2019102901 <= VALUES < 2019102902,
> PARTITION 2019102902 <= VALUES < 2019102903,
> PARTITION 2019102903 <= VALUES < 2019102904,
> PARTITION 2019102904 <= VALUES < 2019102905,
> ...)
> {code}
> He try to remove many outdated partitions once by SparkSQL, but it returns an
> timeout error at first, then he try again and again, and SparkSQL failed
> again and again. Then the cluster became unstable, memory usage and CPU load
> increasing.
>
> I found many log like:
> {code:java}
> W1030 17:29:53.382287 7588 rpcz_store.cc:259] Trace:
> 1030 17:26:19.714799 (+ 0us) service_pool.cc:162] Inserting onto call
> queue
> 1030 17:26:19.714808 (+ 9us) service_pool.cc:221] Handling call
> 1030 17:29:53.382204 (+213667396us) ts_tablet_manager.cc:874] Deleting tablet
> c52c5f43f7884d08b07fd0005e878fed
> 1030 17:29:53.382205 (+ 1us) ts_tablet_manager.cc:794] Acquired tablet
> manager lock
> 1030 17:29:53.382208 (+ 3us) inbound_call.cc:162] Queueing success
> response
> Metrics: {"tablet-delete.queue_time_us":213667360}
> W1030 17:29:53.382300 7586 rpcz_store.cc:253] Call
> kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576
> (request call id 1820316) took 213667 ms (3.56 min). Client timeout 29999 ms
> (30 s)
> W1030 17:29:53.382292 10623 rpcz_store.cc:253] Call
> kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576
> (request call id 1820315) took 213667 ms (3.56 min). Client timeout 29999 ms
> (30 s)
> W1030 17:29:53.382297 10622 rpcz_store.cc:259] Trace:
> 1030 17:26:19.714825 (+ 0us) service_pool.cc:162] Inserting onto call
> queue
> 1030 17:26:19.714833 (+ 8us) service_pool.cc:221] Handling call
> 1030 17:29:53.382239 (+213667406us) ts_tablet_manager.cc:874] Deleting tablet
> 479f8c592f16408c830637a0129359e1
> 1030 17:29:53.382241 (+ 2us) ts_tablet_manager.cc:794] Acquired tablet
> manager lock
> 1030 17:29:53.382244 (+ 3us) inbound_call.cc:162] Queueing success
> response
> Metrics: {"tablet-delete.queue_time_us":213667378}
> ...{code}
> That means 'Acquired tablet manager lock' cost much time, right?
> {code:java}
> Status TSTabletManager::BeginReplicaStateTransition(
> const string& tablet_id,
> const string& reason,
> scoped_refptr<TabletReplica>* replica,
> scoped_refptr<TransitionInProgressDeleter>* deleter,
> TabletServerErrorPB::Code* error_code) {
> // Acquire the lock in exclusive mode as we'll add a entry to the
> // transition_in_progress_ map.
> std::lock_guard<RWMutex> lock(lock_);
> TRACE("Acquired tablet manager lock");
> RETURN_NOT_OK(CheckRunningUnlocked(error_code));
> ...
> }{code}
> But I think the root case is Kudu master send too many duplicate 'alter
> table/delete tablet' request to tserver. I found more info in master's log:
> {code:java}
> $ grep "Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d"
> kudu_master.zjy-hadoop-prc-ct01.bj.work.log.INFO.20191030-204137.62788 |
> egrep "attempt = 1\)"
> I1030 20:41:42.207222 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 43 ms (attempt = 1)
> I1030 20:41:42.207556 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 40 ms (attempt = 1)
> I1030 20:41:42.260052 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 31 ms (attempt = 1)
> I1030 20:41:42.278609 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 19 ms (attempt = 1)
> I1030 20:41:42.312175 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 48 ms (attempt = 1)
> I1030 20:41:42.318933 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 62 ms (attempt = 1)
> I1030 20:41:42.340060 62821 catalog_manager.cc:2971] Scheduling retry of
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 30 ms (attempt = 1)
> ...{code}
> That means master received too many duplicate 'delete tablet' request from
> client, and then dispatch these request to tservers.
> I think we should limit the concurrent alter request of a table, when a alter
> request is on going and hasn't been finished, the following request should be
> rejected. Or we should limit the size of TableInfo::pending_tasks_.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)