[
https://issues.apache.org/jira/browse/KUDU-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272245#comment-16272245
]
ZhangZhen commented on KUDU-2206:
---------------------------------
Try to conclude this issue.
I have a table with about 30K DRSs in its tablet, which cause
MaintenanceManager::FindBestOp takes a long time(40s) to do compaction policy
calculation, and maintenance manager will hold a lock which the CreateTable rpc
also need, that result in CreateTable rpc timeout.
Todd made an improvement to short circuit the compaction policy calculation as
soon as we know the compaction won't be worthwhile. It works for my case as all
the DRSs of my table don't overlap, thanks [~tlipcon] The review address is
https://gerrit.cloudera.org/#/c/8444/
> Create table timeout due to too many DRS in one tablet cause lock contention
> ----------------------------------------------------------------------------
>
> Key: KUDU-2206
> URL: https://issues.apache.org/jira/browse/KUDU-2206
> Project: Kudu
> Issue Type: Bug
> Affects Versions: 1.3.0
> Reporter: ZhangZhen
> Attachments: kudu_master.log, pstack.zip, trace_tserver07_trace.json,
> tserver07.flags, tserver_01_0f53a0d3.log, tserver_07_23f962e4a1.log,
> tsever_02_0a8bbcbb.log
>
>
> We encountered rpc timeout exception when we use sparksql, which use Java
> kudu client innerly, to create table on kudu cluster. The cluster has 10
> tserver and 1 master on 10 machines, the target table has 10 range partitions
> and 5 hash partitions.
> From the web UI, I found it spent about 3 minutes before all the tablets vote
> a leader, and I can see a lot delete tablet records in the UI like:
> Delete Tablet Running 2.13 min 719f0f496bc34a469e4069b2861b4be8 Delete
> Tablet RPC for TS=044f1da9a27c46acb82b1386f829f4dc
> Also I find many retry records in tserver logs, like:
> W1031 23:04:40.088256 5816 consensus_peers.cc:357] T
> fcde65c4e4cf4df29b9ef9884ce292b2 P 0f53a0d3ef7e44ebb0365c800752d5bd -> Peer
> 23f962e4a1744381ad5fa0d2d8b10241 (c3-kudu-tst-st07.bj:18700): Couldn't send
> request to peer 23f962e4a1744381ad5fa0d2d8b10241 for tablet
> fcde65c4e4cf4df29b9ef9884ce292b2. Error code: TABLET_NOT_RUNNING (12).
> Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next
> heartbeat period. Already tried 94 times.
> You can find the logs of master and tserver since master receive the create
> table request in the attachment.
> The kudu version is 1.3.0, the nearest commit is
> 00813f96b9cb0c9ec57a17e5c85242f7679db0e0
> The exception that client received is like:
> Error: org.apache.kudu.client.NonRecoverableException: RPC can not complete
> before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=25,
> DeadlineTracker(timeout=30000, elapsed=28499), Traces: [0ms] sending RPC to
> server , [0ms] received from server response OK, [20ms] sending RPC to
> server , [20ms] received from server response OK, [40ms] sending RPC to
> server , [40ms] received from server response OK, [59ms] sending RPC to
> server , [60ms] received from server response OK, [80ms] sending RPC to
> server , [80ms] received from server response OK, [100ms] sending RPC to
> server , [100ms] received from server response OK, [140ms] sending RPC to
> server , [141ms] received from server response OK, [200ms] sending RPC to
> server , [200ms] received from server response OK, [319ms] sending RPC to
> server , [320ms] received from server response OK, [780ms] sending RPC to
> server , [780ms] received from server response OK, [2740ms] sending RPC to
> server , [2741ms] received from server response OK, [3580ms] sending RPC to
> server , [3580ms] received from server response OK, [4840ms] sending RPC to
> server , [4840ms] received from server response OK, [7080ms] sending RPC to
> server , [7081ms] received from server response OK, [8320ms] sending RPC to
> server , [8321ms] received from server response OK, [11620ms] sending RPC to
> server , [11621ms] received from server response OK, [13540ms] sending RPC
> to server , [13540ms] received from server response OK, [16819ms] sending
> RPC to server , [16820ms] received from server response OK, [19020ms]
> sending RPC to server , [19020ms] received from server response OK,
> [21340ms] sending RPC to server , [21341ms] received from server response
> OK, [24660ms] sending RPC to server , [24661ms] received from server
> response OK, [26800ms] sending RPC to server , [26800ms] received from server
> response OK, [27660ms] sending RPC to server , [27660ms] received from
> server response OK, [28480ms] sending RPC to server , [28481ms] received
> from server
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)