[ 
https://issues.apache.org/jira/browse/KUDU-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Zhang updated KUDU-2702:
------------------------------
    Description: 
1. For our business, we need to write any taken signal into kudu, so the load 
is very hight. We decide to use spark stream will kudu client to fulfill this 
task. the code like below: 
{code:java}
val kuduContext = new KuduContext("KuduMaster", trackRdd.sparkContext)
........
kuduContext.upsertRows(trackRdd, saveTable){code}
2. check  spark log 
{code:java}
2019-01-30 16:09:31 WARN TaskSetManager:66 - Lost task 0.0 in stage 38855.0 
(TID 25499, 192.168.33.158, executor 2): java.lang.RuntimeException: failed to 
write 1000 rows from DataFrame to Kudu; sample errors: Timed out: can not 
complete before timeout: Batch{operations=58, 
tablet="41f47fabf6964719befd06ad01bc133b" [0x000000088000016804FE4
800, 0x0000000880000168A4A36BFF), ignoreAllDuplicateRows=false, 
rpc=KuduRpc(method=Write, tablet=41f47fabf6964719befd06ad01bc133b, attempt=42, 
DeadlineTracker(timeout=3000
0, elapsed=29675), Traces: [0ms] querying master,
[0ms] Sub rpc: GetTableLocations sending RPC to server 
master-192.168.33.152:7051,
[3ms] Sub rpc: GetTableLocations received from server 
master-192.168.33.152:7051 response OK,
[3ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
[6ms] delaying RPC due to Illegal state: Replica 
a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: FOLLOWER. 
Consensus state: current_term: 1639 committed_config { opid_index: 135 
OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } peers { permanent_uuid: 
"cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
"cm02" port: 7050 } } } (error 0),
[6ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response Illegal 
state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. 
Role: FOLLOWER. Consensus state: current_term: 1639 committed_config { 
opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } peers { permanent_uuid: 
"cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
"cm02" port: 7050 } } } (error0),

.....................

[793ms] querying master,
[793ms] Sub rpc: GetTableLocations sending RPC to server 
master-192.168.33.152:7051,
[795ms] Sub rpc: GetTableLocations received from server 
master-192.168.33.152:7051 response OK,
[796ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
[798ms] delaying RPC due to Illegal state: Replica 
a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: FOLLOWER. 
Consensus state: current_term: 1639 committed_config { opid_index: 135 
OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } peers { permanent_uuid: 
"cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
"cm02" port: 7050 } } } (error 0),
[799ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response Illegal 
state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. 
Role: FOLLOWER. Consensus state: current_term: 1639 committed_config { 
opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } peers { permanent_uuid: 
"cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
"cm02" port: 7050 } } } (error 0),

[3552ms] querying master,
[3552ms] Sub rpc: GetTableLocations sending RPC to server 
master-192.168.33.152:7051,
[3553ms] Sub rpc: GetTableLocations received from server 
master-192.168.33.152:7051 response OK,
[3553ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
[3556ms] delaying RPC due to Illegal state: Replica 
a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: FOLLOWER. 
Consensus state: current_term: 1639 committed_config { opid_index: 135 
OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } p{code}
 get the same issue like 
[[KUDU-2329|https://jira.apache.org/jira/browse/KUDU-2329]|https://jira.apache.org/jira/browse/KUDU-2329]

3. Then use kudu cluster ksck to check staus find some tablet unavailable 
{noformat}
Tablet bb7aff8f0d79458ebd263b57e7ed2848 of table 'impala::flyway.track_2018' is 
under-replicated: 1 replica(s) not RUNNING
a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
All reported replicas are:
A = a9efdf1d4c5d4bfd933876c2c9681e83
B = afba5bc65a93472683cb613a7c693b0f
C = 4a00d2312d5042eeb41a1da0cc264213
The consensus matrix is:
Config source | Replicas | Current term | Config index | Committed?
---------------+------------------------+--------------+--------------+------------
master | A B* C | | | Yes
A | A B* C | 25 | -1 | Yes
B | [config not available] | | | 
C | A B* C | 25 | -1 | Yes


Tablet 82e89518366840aaa3f8bd426818e001 of table 'impala::flyway.track_2017' is 
under-replicated: 1 replica(s) not RUNNING
afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
All reported replicas are:
A = afba5bc65a93472683cb613a7c693b0f
B = 4a00d2312d5042eeb41a1da0cc264213
C = a9efdf1d4c5d4bfd933876c2c9681e83
The consensus matrix is:
Config source | Replicas | Current term | Config index | Committed?
---------------+------------------------+--------------+--------------+------------
master | A* B C | | | Yes
A | [config not available] | | | 
B | A* B C | 29 | -1 | Yes
C | A B C | 28 | -1 | Yes

..........................................

The relative table is CONSENSUS_MISMATCH like 
Name | RF | Status | Total Tablets | Healthy | Recovering | Under-replicated | 
Unavailable
-----------------------------------------------------+----+---------------
impala::flyway.track_2017 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
impala::flyway.track_2018 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
{noformat}
 For not leader available,  the spark is stuck for long time, then data lost. 
We find the status of some tables is  "*CONSENSUS_MISMATCH*" randomly then 
recover [*HEALTHY*] after a while, leaders of some tablets are unavailable. All 
operation are in a LAN, network and machines work fine,  each tablet server own 
1500 tablets, the recommended values is 1000 tablets .

 By the way,  I have a question about voting processing, each tablet seems to 
have an individual vote process, then does need thousands of individual voting 
processing if has thousands of tablets?  Something optimized can do if yes? 

Thanks.

  was:
1. For our business, we need to write any taken signal into kudu, so the load 
is very hight. We decide to use spark stream will kudu client to fulfill this 
task. the code like below: 
{code:java}
val kuduContext = new KuduContext("KuduMaster", trackRdd.sparkContext)
........
kuduContext.upsertRows(trackRdd, saveTable){code}
2. check  spark log 
{code:java}
2019-01-30 16:09:31 WARN TaskSetManager:66 - Lost task 0.0 in stage 38855.0 
(TID 25499, 192.168.33.158, executor 2): java.lang.RuntimeException: failed to 
write 1000 rows from DataFrame to Kudu; sample errors: Timed out: can not 
complete before timeout: Batch{operations=58, 
tablet="41f47fabf6964719befd06ad01bc133b" [0x000000088000016804FE4
800, 0x0000000880000168A4A36BFF), ignoreAllDuplicateRows=false, 
rpc=KuduRpc(method=Write, tablet=41f47fabf6964719befd06ad01bc133b, attempt=42, 
DeadlineTracker(timeout=3000
0, elapsed=29675), Traces: [0ms] querying master,
[0ms] Sub rpc: GetTableLocations sending RPC to server 
master-192.168.33.152:7051,
[3ms] Sub rpc: GetTableLocations received from server 
master-192.168.33.152:7051 response OK,
[3ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
[6ms] delaying RPC due to Illegal state: Replica 
a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: FOLLOWER. 
Consensus state: current_term: 1639 committed_config { opid_index: 135 
OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } peers { permanent_uuid: 
"cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
"cm02" port: 7050 } } } (error 0),
[6ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response Illegal 
state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. 
Role: FOLLOWER. Consensus state: current_term: 1639 committed_config { 
opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } peers { permanent_uuid: 
"cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
"cm02" port: 7050 } } } (error0),

.....................

[793ms] querying master,
[793ms] Sub rpc: GetTableLocations sending RPC to server 
master-192.168.33.152:7051,
[795ms] Sub rpc: GetTableLocations received from server 
master-192.168.33.152:7051 response OK,
[796ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
[798ms] delaying RPC due to Illegal state: Replica 
a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: FOLLOWER. 
Consensus state: current_term: 1639 committed_config { opid_index: 135 
OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } peers { permanent_uuid: 
"cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
"cm02" port: 7050 } } } (error 0),
[799ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response Illegal 
state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. 
Role: FOLLOWER. Consensus state: current_term: 1639 committed_config { 
opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } peers { permanent_uuid: 
"cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
"cm02" port: 7050 } } } (error 0),

[3552ms] querying master,
[3552ms] Sub rpc: GetTableLocations sending RPC to server 
master-192.168.33.152:7051,
[3553ms] Sub rpc: GetTableLocations received from server 
master-192.168.33.152:7051 response OK,
[3553ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
[3556ms] delaying RPC due to Illegal state: Replica 
a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: FOLLOWER. 
Consensus state: current_term: 1639 committed_config { opid_index: 135 
OBSOLETE_local: false peers { permanent_uuid: 
"a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
"cm07" port: 7050 } } peers { permanent_uuid: 
"083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
"cm04" port: 7050 } } p{code}
 get the same issue like 
[[KUDU-2329|https://jira.apache.org/jira/browse/KUDU-2329]|https://jira.apache.org/jira/browse/KUDU-2329]

3. Then use kudu cluster ksck to check staus find some tablet unavailable 
{noformat}
Tablet bb7aff8f0d79458ebd263b57e7ed2848 of table 'impala::flyway.track_2018' is 
under-replicated: 1 replica(s) not RUNNING
a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
All reported replicas are:
A = a9efdf1d4c5d4bfd933876c2c9681e83
B = afba5bc65a93472683cb613a7c693b0f
C = 4a00d2312d5042eeb41a1da0cc264213
The consensus matrix is:
Config source | Replicas | Current term | Config index | Committed?
---------------+------------------------+--------------+--------------+------------
master | A B* C | | | Yes
A | A B* C | 25 | -1 | Yes
B | [config not available] | | | 
C | A B* C | 25 | -1 | Yes


Tablet 82e89518366840aaa3f8bd426818e001 of table 'impala::flyway.track_2017' is 
under-replicated: 1 replica(s) not RUNNING
afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
All reported replicas are:
A = afba5bc65a93472683cb613a7c693b0f
B = 4a00d2312d5042eeb41a1da0cc264213
C = a9efdf1d4c5d4bfd933876c2c9681e83
The consensus matrix is:
Config source | Replicas | Current term | Config index | Committed?
---------------+------------------------+--------------+--------------+------------
master | A* B C | | | Yes
A | [config not available] | | | 
B | A* B C | 29 | -1 | Yes
C | A B C | 28 | -1 | Yes

..........................................

The relative table is CONSENSUS_MISMATCH like 
Name | RF | Status | Total Tablets | Healthy | Recovering | Under-replicated | 
Unavailable
-----------------------------------------------------+----+---------------
impala::flyway.track_2017 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
impala::flyway.track_2018 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
{noformat}
 For not leader available,  the spark is stuck for long time, then data lost. 
We find the status of some tables is  "*CONSENSUS_MISMATCH*" randomly then 
recover [*HEALTHY*] after a while, leaders of some tablets are unavailable. All 
operation are in a LAN, network and machines work fine,  each tablet server own 
1500 tablets, the recommended values is 1000 tablets .

 By the way,  I have a question about voting processing, each tablet seems to 
have an individual vote process, then does need thousands of individual voting 
processing if has thousands of tablets?  Something optimized can do if yes? 


> data lost by using spark kudu client during high load writing  
> ---------------------------------------------------------------
>
>                 Key: KUDU-2702
>                 URL: https://issues.apache.org/jira/browse/KUDU-2702
>             Project: Kudu
>          Issue Type: Bug
>          Components: client, consensus, spark
>    Affects Versions: 1.8.0
>            Reporter: Simon Zhang
>            Priority: Major
>
> 1. For our business, we need to write any taken signal into kudu, so the load 
> is very hight. We decide to use spark stream will kudu client to fulfill this 
> task. the code like below: 
> {code:java}
> val kuduContext = new KuduContext("KuduMaster", trackRdd.sparkContext)
> ........
> kuduContext.upsertRows(trackRdd, saveTable){code}
> 2. check  spark log 
> {code:java}
> 2019-01-30 16:09:31 WARN TaskSetManager:66 - Lost task 0.0 in stage 38855.0 
> (TID 25499, 192.168.33.158, executor 2): java.lang.RuntimeException: failed 
> to write 1000 rows from DataFrame to Kudu; sample errors: Timed out: can not 
> complete before timeout: Batch{operations=58, 
> tablet="41f47fabf6964719befd06ad01bc133b" [0x000000088000016804FE4
> 800, 0x0000000880000168A4A36BFF), ignoreAllDuplicateRows=false, 
> rpc=KuduRpc(method=Write, tablet=41f47fabf6964719befd06ad01bc133b, 
> attempt=42, DeadlineTracker(timeout=3000
> 0, elapsed=29675), Traces: [0ms] querying master,
> [0ms] Sub rpc: GetTableLocations sending RPC to server 
> master-192.168.33.152:7051,
> [3ms] Sub rpc: GetTableLocations received from server 
> master-192.168.33.152:7051 response OK,
> [3ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
> [6ms] delaying RPC due to Illegal state: Replica 
> a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: 
> FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index: 
> 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } peers { permanent_uuid: 
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
> "cm02" port: 7050 } } } (error 0),
> [6ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response Illegal 
> state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. 
> Role: FOLLOWER. Consensus state: current_term: 1639 committed_config { 
> opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } peers { permanent_uuid: 
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
> "cm02" port: 7050 } } } (error0),
> .....................
> [793ms] querying master,
> [793ms] Sub rpc: GetTableLocations sending RPC to server 
> master-192.168.33.152:7051,
> [795ms] Sub rpc: GetTableLocations received from server 
> master-192.168.33.152:7051 response OK,
> [796ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
> [798ms] delaying RPC due to Illegal state: Replica 
> a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: 
> FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index: 
> 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } peers { permanent_uuid: 
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
> "cm02" port: 7050 } } } (error 0),
> [799ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response 
> Illegal state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this 
> config. Role: FOLLOWER. Consensus state: current_term: 1639 committed_config 
> { opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } peers { permanent_uuid: 
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
> "cm02" port: 7050 } } } (error 0),
> [3552ms] querying master,
> [3552ms] Sub rpc: GetTableLocations sending RPC to server 
> master-192.168.33.152:7051,
> [3553ms] Sub rpc: GetTableLocations received from server 
> master-192.168.33.152:7051 response OK,
> [3553ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
> [3556ms] delaying RPC due to Illegal state: Replica 
> a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: 
> FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index: 
> 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } p{code}
>  get the same issue like 
> [[KUDU-2329|https://jira.apache.org/jira/browse/KUDU-2329]|https://jira.apache.org/jira/browse/KUDU-2329]
> 3. Then use kudu cluster ksck to check staus find some tablet unavailable 
> {noformat}
> Tablet bb7aff8f0d79458ebd263b57e7ed2848 of table 'impala::flyway.track_2018' 
> is under-replicated: 1 replica(s) not RUNNING
> a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
> afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
> 4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
> All reported replicas are:
> A = a9efdf1d4c5d4bfd933876c2c9681e83
> B = afba5bc65a93472683cb613a7c693b0f
> C = 4a00d2312d5042eeb41a1da0cc264213
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> master | A B* C | | | Yes
> A | A B* C | 25 | -1 | Yes
> B | [config not available] | | | 
> C | A B* C | 25 | -1 | Yes
> Tablet 82e89518366840aaa3f8bd426818e001 of table 'impala::flyway.track_2017' 
> is under-replicated: 1 replica(s) not RUNNING
> afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
> 4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
> a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
> All reported replicas are:
> A = afba5bc65a93472683cb613a7c693b0f
> B = 4a00d2312d5042eeb41a1da0cc264213
> C = a9efdf1d4c5d4bfd933876c2c9681e83
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> master | A* B C | | | Yes
> A | [config not available] | | | 
> B | A* B C | 29 | -1 | Yes
> C | A B C | 28 | -1 | Yes
> ..........................................
> The relative table is CONSENSUS_MISMATCH like 
> Name | RF | Status | Total Tablets | Healthy | Recovering | Under-replicated 
> | Unavailable
> -----------------------------------------------------+----+---------------
> impala::flyway.track_2017 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
> impala::flyway.track_2018 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
> {noformat}
>  For not leader available,  the spark is stuck for long time, then data lost. 
> We find the status of some tables is  "*CONSENSUS_MISMATCH*" randomly then 
> recover [*HEALTHY*] after a while, leaders of some tablets are unavailable. 
> All operation are in a LAN, network and machines work fine,  each tablet 
> server own 1500 tablets, the recommended values is 1000 tablets .
>  By the way,  I have a question about voting processing, each tablet seems to 
> have an individual vote process, then does need thousands of individual 
> voting processing if has thousands of tablets?  Something optimized can do if 
> yes? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to