[
https://issues.apache.org/jira/browse/KUDU-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780255#comment-16780255
]
Simon Zhang edited comment on KUDU-2702 at 2/28/19 8:33 AM:
------------------------------------------------------------
By the way, rebalance only can move data among ts, but some disks in a ts can't
been rebalanced, some way to archive it ?
was (Author: simon_zhang):
By the way, rebalance only can move data among ts, but some disks in a ts can't
been rebalanced, some way to implement it ?
> data lost by using spark kudu client during high load writing
> ---------------------------------------------------------------
>
> Key: KUDU-2702
> URL: https://issues.apache.org/jira/browse/KUDU-2702
> Project: Kudu
> Issue Type: Bug
> Components: client, consensus, spark
> Affects Versions: 1.8.0
> Reporter: Simon Zhang
> Priority: Major
> Attachments: 屏幕快照 2019-02-20 下午4.57.08.png
>
>
> 1. For our business, we need to write any taken signal into kudu, so the load
> is very hight. We decide to use spark stream will kudu client to fulfill this
> task. the code like below:
> {code:java}
> val kuduContext = new KuduContext("KuduMaster", trackRdd.sparkContext)
> ........
> kuduContext.upsertRows(trackRdd, saveTable){code}
> 2. check spark log
> {code:java}
> 2019-01-30 16:09:31 WARN TaskSetManager:66 - Lost task 0.0 in stage 38855.0
> (TID 25499, 192.168.33.158, executor 2): java.lang.RuntimeException: failed
> to write 1000 rows from DataFrame to Kudu; sample errors: Timed out: can not
> complete before timeout: Batch{operations=58,
> tablet="41f47fabf6964719befd06ad01bc133b" [0x000000088000016804FE4
> 800, 0x0000000880000168A4A36BFF), ignoreAllDuplicateRows=false,
> rpc=KuduRpc(method=Write, tablet=41f47fabf6964719befd06ad01bc133b,
> attempt=42, DeadlineTracker(timeout=3000
> 0, elapsed=29675), Traces: [0ms] querying master,
> [0ms] Sub rpc: GetTableLocations sending RPC to server
> master-192.168.33.152:7051,
> [3ms] Sub rpc: GetTableLocations received from server
> master-192.168.33.152:7051 response OK,
> [3ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
> [6ms] delaying RPC due to Illegal state: Replica
> a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role:
> FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index:
> 135 OBSOLETE_local: false peers { permanent_uuid:
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host:
> "cm07" port: 7050 } } peers { permanent_uuid:
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host:
> "cm04" port: 7050 } } peers { permanent_uuid:
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host:
> "cm02" port: 7050 } } } (error 0),
> [6ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response Illegal
> state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config.
> Role: FOLLOWER. Consensus state: current_term: 1639 committed_config {
> opid_index: 135 OBSOLETE_local: false peers { permanent_uuid:
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host:
> "cm07" port: 7050 } } peers { permanent_uuid:
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host:
> "cm04" port: 7050 } } peers { permanent_uuid:
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host:
> "cm02" port: 7050 } } } (error0),
> .....................
> [793ms] querying master,
> [793ms] Sub rpc: GetTableLocations sending RPC to server
> master-192.168.33.152:7051,
> [795ms] Sub rpc: GetTableLocations received from server
> master-192.168.33.152:7051 response OK,
> [796ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
> [798ms] delaying RPC due to Illegal state: Replica
> a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role:
> FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index:
> 135 OBSOLETE_local: false peers { permanent_uuid:
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host:
> "cm07" port: 7050 } } peers { permanent_uuid:
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host:
> "cm04" port: 7050 } } peers { permanent_uuid:
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host:
> "cm02" port: 7050 } } } (error 0),
> [799ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response
> Illegal state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this
> config. Role: FOLLOWER. Consensus state: current_term: 1639 committed_config
> { opid_index: 135 OBSOLETE_local: false peers { permanent_uuid:
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host:
> "cm07" port: 7050 } } peers { permanent_uuid:
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host:
> "cm04" port: 7050 } } peers { permanent_uuid:
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host:
> "cm02" port: 7050 } } } (error 0),
> [3552ms] querying master,
> [3552ms] Sub rpc: GetTableLocations sending RPC to server
> master-192.168.33.152:7051,
> [3553ms] Sub rpc: GetTableLocations received from server
> master-192.168.33.152:7051 response OK,
> [3553ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
> [3556ms] delaying RPC due to Illegal state: Replica
> a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role:
> FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index:
> 135 OBSOLETE_local: false peers { permanent_uuid:
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host:
> "cm07" port: 7050 } } peers { permanent_uuid:
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host:
> "cm04" port: 7050 } } p{code}
> get the same issue like
> [[KUDU-2329|https://jira.apache.org/jira/browse/KUDU-2329]|https://jira.apache.org/jira/browse/KUDU-2329]
> 3. Then use kudu cluster ksck to check staus find some tablet unavailable
> {noformat}
> Tablet bb7aff8f0d79458ebd263b57e7ed2848 of table 'impala::flyway.track_2018'
> is under-replicated: 1 replica(s) not RUNNING
> a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
> afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
> 4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
> All reported replicas are:
> A = a9efdf1d4c5d4bfd933876c2c9681e83
> B = afba5bc65a93472683cb613a7c693b0f
> C = 4a00d2312d5042eeb41a1da0cc264213
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> master | A B* C | | | Yes
> A | A B* C | 25 | -1 | Yes
> B | [config not available] | | |
> C | A B* C | 25 | -1 | Yes
> Tablet 82e89518366840aaa3f8bd426818e001 of table 'impala::flyway.track_2017'
> is under-replicated: 1 replica(s) not RUNNING
> afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
> 4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
> a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
> All reported replicas are:
> A = afba5bc65a93472683cb613a7c693b0f
> B = 4a00d2312d5042eeb41a1da0cc264213
> C = a9efdf1d4c5d4bfd933876c2c9681e83
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> master | A* B C | | | Yes
> A | [config not available] | | |
> B | A* B C | 29 | -1 | Yes
> C | A B C | 28 | -1 | Yes
> ..........................................
> The relative table is CONSENSUS_MISMATCH like
> Name | RF | Status | Total Tablets | Healthy | Recovering | Under-replicated
> | Unavailable
> -----------------------------------------------------+----+---------------
> impala::flyway.track_2017 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
> impala::flyway.track_2018 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
> {noformat}
> For not leader available, the spark is stuck for long time, then data lost.
> We find the status of some tables is "*CONSENSUS_MISMATCH*" randomly then
> recover [*HEALTHY*] after a while, leaders of some tablets are unavailable.
> All operation are in a LAN, network and machines work fine, each tablet
> server own 1500 tablets, the recommended values is 1000 tablets .
> By the way, I have a question about voting processing, each tablet seems to
> have an individual vote process, then does need thousands of individual
> voting processing if has thousands of tablets? Something optimized can do if
> yes?
> Thanks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)