[ https://issues.apache.org/jira/browse/KUDU-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ZhangZhen updated KUDU-2206: ---------------------------- Attachment: trace_tserver07_trace.json tserver07.flags Just let the trace "record" runs for 40 minutes and get the trace json file, also attach the flags of tserver07, hope it helps. > Kudu client create table timeout > -------------------------------- > > Key: KUDU-2206 > URL: https://issues.apache.org/jira/browse/KUDU-2206 > Project: Kudu > Issue Type: Bug > Affects Versions: 1.3.0 > Reporter: ZhangZhen > Priority: Major > Attachments: kudu_master.log, pstack.zip, trace_tserver07_trace.json, > tserver07.flags, tserver_01_0f53a0d3.log, tserver_07_23f962e4a1.log, > tsever_02_0a8bbcbb.log > > > We encountered rpc timeout exception when we use sparksql, which use Java > kudu client innerly, to create table on kudu cluster. The cluster has 10 > tserver and 1 master on 10 machines, the target table has 10 range partitions > and 5 hash partitions. > From the web UI, I found it spent about 3 minutes before all the tablets vote > a leader, and I can see a lot delete tablet records in the UI like: > Delete Tablet Running 2.13 min 719f0f496bc34a469e4069b2861b4be8 Delete > Tablet RPC for TS=044f1da9a27c46acb82b1386f829f4dc > Also I find many retry records in tserver logs, like: > W1031 23:04:40.088256 5816 consensus_peers.cc:357] T > fcde65c4e4cf4df29b9ef9884ce292b2 P 0f53a0d3ef7e44ebb0365c800752d5bd -> Peer > 23f962e4a1744381ad5fa0d2d8b10241 (c3-kudu-tst-st07.bj:18700): Couldn't send > request to peer 23f962e4a1744381ad5fa0d2d8b10241 for tablet > fcde65c4e4cf4df29b9ef9884ce292b2. Error code: TABLET_NOT_RUNNING (12). > Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next > heartbeat period. Already tried 94 times. > You can find the logs of master and tserver since master receive the create > table request in the attachment. > The kudu version is 1.3.0, the nearest commit is > 00813f96b9cb0c9ec57a17e5c85242f7679db0e0 > The exception that client received is like: > Error: org.apache.kudu.client.NonRecoverableException: RPC can not complete > before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=25, > DeadlineTracker(timeout=30000, elapsed=28499), Traces: [0ms] sending RPC to > server , [0ms] received from server response OK, [20ms] sending RPC to > server , [20ms] received from server response OK, [40ms] sending RPC to > server , [40ms] received from server response OK, [59ms] sending RPC to > server , [60ms] received from server response OK, [80ms] sending RPC to > server , [80ms] received from server response OK, [100ms] sending RPC to > server , [100ms] received from server response OK, [140ms] sending RPC to > server , [141ms] received from server response OK, [200ms] sending RPC to > server , [200ms] received from server response OK, [319ms] sending RPC to > server , [320ms] received from server response OK, [780ms] sending RPC to > server , [780ms] received from server response OK, [2740ms] sending RPC to > server , [2741ms] received from server response OK, [3580ms] sending RPC to > server , [3580ms] received from server response OK, [4840ms] sending RPC to > server , [4840ms] received from server response OK, [7080ms] sending RPC to > server , [7081ms] received from server response OK, [8320ms] sending RPC to > server , [8321ms] received from server response OK, [11620ms] sending RPC to > server , [11621ms] received from server response OK, [13540ms] sending RPC > to server , [13540ms] received from server response OK, [16819ms] sending > RPC to server , [16820ms] received from server response OK, [19020ms] > sending RPC to server , [19020ms] received from server response OK, > [21340ms] sending RPC to server , [21341ms] received from server response > OK, [24660ms] sending RPC to server , [24661ms] received from server > response OK, [26800ms] sending RPC to server , [26800ms] received from server > response OK, [27660ms] sending RPC to server , [27660ms] received from > server response OK, [28480ms] sending RPC to server , [28481ms] received > from server -- This message was sent by Atlassian JIRA (v6.4.14#64029)