[ 
https://issues.apache.org/jira/browse/KUDU-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110649#comment-17110649
 ] 

Alexey Serbin commented on KUDU-2453:
-------------------------------------

There is a reproduction scenario for the issue described in this JIRA: 
https://gerrit.cloudera.org/#/c/15912/

> kudu should stop creating tablet infinitely
> -------------------------------------------
>
>                 Key: KUDU-2453
>                 URL: https://issues.apache.org/jira/browse/KUDU-2453
>             Project: Kudu
>          Issue Type: Bug
>          Components: master, tserver
>    Affects Versions: 1.4.0, 1.7.2
>            Reporter: LiFu He
>            Priority: Major
>
> I have met this problem again on 2018/10/26. And now the kudu version is 
> 1.7.2.
> -----------------------------------------------------
> We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and 
> there are some load on the kudu cluster. Then someone else created a big 
> table which had tens of thousands of tablets from impala-shell (that was a 
> mistake). 
> {code:java}
> CREATE TABLE XXX(
> ...
>    PRIMARY KEY (...)
> )
> PARTITION BY HASH (...) PARTITIONS 100,
> RANGE (...)
> (
>   PARTITION "2018-10-24" <= VALUES < "2018-10-24\000",
>   PARTITION "2018-10-25" <= VALUES < "2018-10-25\000",
>   ...
>   PARTITION "2018-12-07" <= VALUES < "2018-12-07\000"
> )
> STORED AS KUDU
> TBLPROPERTIES ('kudu.master_addresses'= '...');
> {code}
> Here are the logs after creating table (only pick one tablet as example):
> {code:java}
> ------------------------------------------------------Kudu-master log
> ======e884bda6bbd3482f94c07ca0f34f99a4======
> W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS 
> 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC 
> failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service 
> unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService 
> from 10.120.219.118:50247 dropped due to backpressure. The service queue is 
> full; it has 512 items.
> I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of 
> CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS 
> 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1)
> ...
> ======Be replaced by 0b144c00f35d48cca4d4981698faef72======
> W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T 
> 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 
> e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature 
> [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed 
> timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72
> ...
> I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T 
> 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Sending 
> DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4
> ...
> I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending 
> DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 
> on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by 
> 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST)
> ...
> W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS 
> 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for 
> tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: 
> Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 
> already in progress: creating tablet
> ...
> I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of 
> e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for 
> TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1)
> ...
> W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS 
> b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC 
> failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service 
> unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService 
> from 10.120.219.118:59735 dropped due to backpressure. The service queue is 
> full; it has 512 items.
> I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of 
> CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS 
> b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1)
> ...
> ======Be replaced by c0e0acc448fc42fc9e48f5025b112a75======
> W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T 
> 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 
> 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature 
> [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed 
> timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75
> ...
> ------------------------------------------------------Kudu-tserver log
> I1024 11:40:52.014571 137358 tablet_service.cc:758] Processing CreateTablet 
> for tablet e884bda6bbd3482f94c07ca0f34f99a4 
> (table=quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]), 
> partition=HASH (user_id) PARTITION 29, RANGE (dt) PARTITION "2018-11-10" <= 
> VALUES < "2018-11-10\000"
> ...
> I1024 11:40:52.017539 137358 ts_tablet_manager.cc:1080] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: 
> Registered tablet (data state: TABLET_DATA_READY)
> ...
> I1024 11:41:22.392292 137355 tablet_service.cc:799] Processing DeleteTablet 
> for tablet e884bda6bbd3482f94c07ca0f34f99a4 with delete_type 
> TABLET_DATA_DELETED (Replaced by 0b144c00f35d48cca4d4981698faef72 at 
> 2018-10-24 11:41:22 CST) from {username='kudu'} at 10.120.219.118:50247
> ...
> I1024 12:03:31.079942 126376 ts_tablet_manager.cc:938] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: 
> Bootstrapping tablet
> I1024 12:03:31.079965 126376 tablet_bootstrap.cc:436] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: 
> Bootstrap starting.
> ...
> I1024 12:03:31.080237 126376 tablet_bootstrap.cc:581] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No 
> blocks or log segments found. Creating new log.
> ...
> I1024 12:03:31.080834 126376 tablet_bootstrap.cc:436] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No 
> bootstrap required, opened a new log
> I1024 12:03:31.080870 126376 ts_tablet_manager.cc:955] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time 
> spent bootstrapping tablet: real 0.001s user 0.004s sys 0.000s
> I1024 12:03:31.080987 137368 tablet_service.cc:799] Processing DeleteTablet 
> for tablet 0ba3a2963c524d859a85ad5fbce5bf96 with delete_type 
> TABLET_DATA_DELETED (Replaced by 539e0ab40ec1442a94b330c18521d27f at 
> 2018-10-24 11:42:23 CST) from {username='kudu'} at 10.120.219.118:50247
> I1024 12:03:31.080996 126376 raft_consensus.cc:304] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 
> FOLLOWER]: Replica starting. Triggering 0 pending transactions. Active 
> config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
> "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: 
> "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: 
> "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: 
> "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: 
> "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: 
> "kudu5.lt.163.org" port: 7050 } }
> I1024 12:03:31.081007 126376 raft_consensus.cc:330] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 
> FOLLOWER]: Consensus starting up: Expiring failure detector timer to make a 
> prompt election more likely
> I1024 12:03:31.081014 126376 raft_consensus.cc:605] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 
> FOLLOWER]: Becoming Follower/Learner. State: Replica: 
> 39f15fcf42ef45bba0c95a3223dc25ee, State: Initialized, Role: FOLLOWER
> I1024 12:03:31.081053 126376 consensus_queue.cc:226] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee 
> [NON_LEADER]: Queue going to NON_LEADER mode. State: All replicated index: 0, 
> Majority replicated index: 0, Committed index: 0, Last appended: 0.0, Last 
> appended by leader: 0, Current term: 0, Majority size: -1, State: 0, Mode: 
> NON_LEADER, active raft config: opid_index: -1 OBSOLETE_local: false peers { 
> permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER 
> last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { 
> permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER 
> last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { 
> permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER 
> last_known_addr { host: "kudu5.lt.163.org" port: 7050 } }
> ...
> I1024 12:03:35.315024 161474 raft_consensus.cc:436] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 
> FOLLOWER]: Starting pre-election (no leader contacted us within the election 
> timeout)
> I1024 12:03:35.315050 161474 raft_consensus.cc:2700] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing 
> failure detection for 2.642s (starting election)
> I1024 12:03:35.315111 161474 raft_consensus.cc:458] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 
> FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: 
> false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: 
> VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { 
> permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER 
> last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { 
> permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER 
> last_known_addr { host: "kudu5.lt.163.org" port: 7050 } }
> I1024 12:03:35.315141 137342 tablet_service.cc:799] Processing DeleteTablet 
> for tablet c3f47046905e4c0f9ffe21a9e7a5bc8b with delete_type 
> TABLET_DATA_DELETED (Replaced by 28f11e17ecdc45d2a4d72c74afc2a488 at 
> 2018-10-24 11:45:27 CST) from {username='kudu'} at 10.120.219.118:50247
> I1024 12:03:35.315763 161474 leader_election.cc:231] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee 
> [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer 
> cd75ddaabaa8487b9fe1feab9d2cba83
> I1024 12:03:35.315790 161474 leader_election.cc:231] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee 
> [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer 
> 061221dd1c3e40a3a3338afc74bb66f5
> W1024 12:03:35.315987 137266 leader_election.cc:293] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee 
> [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to 
> peer cd75ddaabaa8487b9fe1feab9d2cba83: Illegal state: must be running to vote 
> when last-logged opid is not known
> W1024 12:03:35.316056 137267 leader_election.cc:293] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee 
> [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to 
> peer 061221dd1c3e40a3a3338afc74bb66f5: Illegal state: must be running to vote 
> when last-logged opid is not known
> I1024 12:03:35.316074 137267 leader_election.cc:258] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee 
> [CANDIDATE]: Term 1 pre-election: Election decided. Result: candidate lost.
> I1024 12:03:35.316102 161474 raft_consensus.cc:2700] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing 
> failure detection for 4.310s (election complete)
> I1024 12:03:35.316115 161474 raft_consensus.cc:2455] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 
> FOLLOWER]: Leader pre-election lost for term 1. Reason: could not achieve 
> majority
> ...
> I1024 12:04:01.303092 126376 ts_tablet_manager.cc:983] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time 
> spent starting tablet: real 30.222s user 0.000s sys 0.000s
> W1024 12:04:01.303108 126376 ts_tablet_manager.cc:1006] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Tablet 
> startup took 30222ms
> ...
> W1024 12:04:01.303225 126376 ts_tablet_manager.cc:1008] T 
> e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Trace:
> 1024 11:40:51.956652 (+ 0us) service_pool.cc:163] Inserting onto call queue
> 1024 11:40:52.014539 (+ 57887us) service_pool.cc:222] Handling call
> 1024 11:40:52.014585 (+ 46us) ts_tablet_manager.cc:357] Acquired tablet 
> manager lock
> 1024 11:40:52.014588 (+ 3us) ts_tablet_manager.cc:370] Creating new 
> metadata...
> 1024 11:40:52.016532 (+ 1944us) tablet_metadata.cc:556] Metadata flushed
> 1024 11:40:52.017445 (+ 913us) tablet_replica.cc:143] Creating consensus 
> instance
> 1024 11:40:52.017601 (+ 156us) inbound_call.cc:157] Queueing success response
> 1024 12:03:31.079950 (+1359062349us) ts_tablet_manager.cc:939] Bootstrapping 
> tablet
> 1024 12:03:31.080879 (+ 929us) ts_tablet_manager.cc:984] Starting tablet 
> replica
> 1024 12:03:31.080893 (+ 14us) tablet_replica.cc:195] Starting instrumentation
> 1024 12:03:31.080900 (+ 7us) tablet_replica.cc:207] Starting consensus
> Metrics: 
> {"fdatasync":1,"fdatasync_us":238,"mutex_wait_us":30221902,"raft.queue_time_us":6,"raft.run_cpu_time_us":4,"raft.run_wall_time_us":5,"tablet-open.queue_time_us":1359062354}
> {code}
> {color:#ff0000}Tablet startup took 30+ seconds.{color}
> It was a long time for him to wait, so he did "ctrl+c". But we found that the 
> tablets in 'INITIALIZED' status was growing rapidly, -half- an hour later it 
> was 350,000:(
> {color:#ff0000}100(hash) * 45(range) * 3(RF) * (60(minute) * 60(second) / 
> 30(repeat/second)) / 5(tservers) = 324000 (tablets/tserver).{color}
> {color:#333333}{color:#ff0000}(created tablets: {color}{color:#ff0000}60m * 
> 60s / 30+s * 12(threads) = 1440 (tablets per hour)){color}{color}
> We deleted this table by kudu client tool, and found that the number of 
> 'INITIALIZED' tablets was going down slowly. By simple estimating it will 
> take 10+ days to be back to normal.  But luckily, the application system are 
> not affected.
> {code:java}
> ------------------------------------------------------Kudu-master log with 
> another tablet uuid
> I1031 16:21:21.644222 180146 catalog_manager.cc:2922] Sending 
> DeleteTablet(TABLET_DATA_DELETED) for tablet d1fd56be8eef44e782d509a0eeae9c15 
> on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by 
> ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST)
> W1031 16:21:21.644421 180146 catalog_manager.cc:2892] TS 
> 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for 
> tablet d1fd56be8eef44e782d509a0eeae9c15 with error code TABLET_NOT_RUNNING: 
> Already present: State transition of tablet d1fd56be8eef44e782d509a0eeae9c15 
> already in progress: creating tablet
> I1031 16:21:21.644436 180146 catalog_manager.cc:2700] Scheduling retry of 
> d1fd56be8eef44e782d509a0eeae9c15 Delete Tablet RPC for 
> TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 553 ms (attempt = 6)
> ------------------------------------------------------Kudu-tserver log
> I1031 16:21:22.197888 137341 tablet_service.cc:799] Processing DeleteTablet 
> for tablet d1fd56be8eef44e782d509a0eeae9c15 with delete_type 
> TABLET_DATA_DELETED (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at 
> 2018-10-24 12:39:17 CST) from {username='kudu'} at 10.120.219.118:50247
> I1031 16:21:22.230309 137131 maintenance_manager.cc:492] P 
> 39f15fcf42ef45bba0c95a3223dc25ee: 
> FlushDeltaMemStoresOp(70499bc0f9ac4d8196ae5a0be6ef0b8b) complete. Timing: 
> real 0.416s    user 0.404s     sys 0.008s Metrics: 
> {"fdatasync":3,"fdatasync_us":2583,"lbm_write_time_us":29,"lbm_writes_lt_1ms":4}
> I1031 16:21:22.321700 137341 tablet_service.cc:799] Processing DeleteTablet 
> for tablet 74a30181dea9400a9bcfaeb56f83f379 with delete_type 
> TABLET_DATA_DELETED (Replaced by 31e350fddea443048946f5a20d3171bd at 
> 2018-10-31 16:21:13 CST) from {username='kudu'} at 10.120.219.118:50247
> I1031 16:21:22.350440 137341 tablet_service.cc:799] Processing DeleteTablet 
> for tablet 7c864af01309432c9a2a4d1c88bbe52b with delete_type 
> TABLET_DATA_DELETED (Replaced by ec4b733818d940e0af32c51bda3c7^C
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to