[ 
https://issues.apache.org/jira/browse/KUDU-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393751#comment-16393751
 ] 

Alexey Serbin commented on KUDU-2152:
-------------------------------------

Maybe, it's useful to know that the problem with a stuck replica is 
reproducible by {{raft_consensus_stress-itest}} in 3-2-3 mode (I was running 
release build using dist-test).  I attached the full log for the reference.

{noformat}
Tablet 6152840eeaca4e6c9df2ed2628fe6343 of table 'RemoveReplaceInCycle' is 
under-replicated: 1 replica(s) not RUNNING
  32b66acea4bc41f4ae61b0c7c301a877 (127.0.79.4:53796): bad state                
    State:       STOPPED                                                        
    Data state:  TABLET_DATA_TOMBSTONED                                         
    Last status: Deleted tablet blocks from disk                                
  4303cec039ae4a3b81ad46bebdefe20d (127.0.79.1:57883): RUNNING [LEADER]         
  f1c579c60213477087630fa4883170dc (127.0.79.6:40847): RUNNING                  
                                                                                
1 replicas' active configs differ from the master's.                            
  All the peers reported by the master and tablet servers are:                  
  A = 32b66acea4bc41f4ae61b0c7c301a877                                          
  B = 4303cec039ae4a3b81ad46bebdefe20d                                          
  C = f1c579c60213477087630fa4883170dc                                          
                                                                                
The consensus matrix is:                                                        
 Config source |    Voters    | Current term | Config index | Committed?        
---------------+--------------+--------------+--------------+------------       
 master        | A   B*  C    |              |              | Yes               
 A             | A   B   C    | 4            | 103          | Yes               
 B             | A   B*  C    | 4            | 110          | Yes               
 C             | A   B*  C    | 4            | 110          | Yes    
{noformat}

> Tablet stuck under-replicated after some kind of tablet copy issue
> ------------------------------------------------------------------
>
>                 Key: KUDU-2152
>                 URL: https://issues.apache.org/jira/browse/KUDU-2152
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.5.0
>            Reporter: Todd Lipcon
>            Assignee: Andrew Wong
>            Priority: Critical
>
> I was stress testing with the following setup:
> - 8 servers (n1-standard-4 GCE boxes)
> - created a bunch of 100-tablet tablets using loadgen until I had ~2500 
> replicas on each server
> - mounted another server using sshfs and put cmeta on that mount point (to 
> make slower cmeta writes)
> - stress -c4 on all machines
> - shut down a server and wait for re-replication (green ksck), restart the 
> server, rinse repeat
> Eventually I got a stuck tablet. ksck reports:
> {code}
> Tablet 271df8901d98442cb478593babd8a609 of table 
> 'loadgen_auto_8e32cb07eb83458da4ec4d228bcb0f5a' is under-replicated: 1 
> replica(s) not RUNNING
>   20d4d86f182043398594b67492d13fdc (kudu513-8.gce.cloudera.com:7050): RUNNING 
> [LEADER]
>   c2ea8f22f4034bcc97e26c9236811960 (kudu513-1.gce.cloudera.com:7050): bad 
> state
>     State:       STOPPED
>     Data state:  TABLET_DATA_COPYING
>     Last status: Deleted tablet blocks from disk
>   cd0997b908ad41839f56a1b61210f2d4 (kudu513-3.gce.cloudera.com:7050): RUNNING
> 1 replicas' active configs differ from the master's.
>   All the peers reported by the master and tablet servers are:
>   A = 20d4d86f182043398594b67492d13fdc
>   D = 471027436ee8405ab7cdf8d22407696b
>   B = c2ea8f22f4034bcc97e26c9236811960
>  
>  C = cd0997b908ad41839f56a1b61210f2d4
> The consensus matrix is:
>  Config source |      Voters      | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
>  master        | A*      B   C    |              |              | Yes
>  A             | A*      B   C    | 11           | 29           | Yes
>  B             |     D   B   C    | 9            | 23           | Yes
>  C             | A*      B   C    | 11           | 29           | Yes
> {code}
> The leader ("A" above) just keeps reporting that it's failing to send 
> requests to "B" because it's getting TABLET_NOT_RUNNING. So it never evicts 
> it (the leader treats TABLET_NOT_RUNNING as a temporary condition assuming 
> that it actually means BOOTSTRAPPING).
> "B"'s last bit in the logs were:
> {code}
> I0920 16:41:48.556422  3808 tablet_copy_client.cc:209] T 
> 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: tablet 
> copy: Beginning tablet copy session from remote peer at address 
> kudu513-8.gce.cloudera.com:7050
> I0920 16:41:48.562335  3808 ts_tablet_manager.cc:1118] T 
> 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: Deleting 
> tablet data with delete state TABLET_DATA_COPYING
> W0920 16:41:48.578610  3808 env_util.cc:277] Failed to determine if path is a 
> directory: 
> /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: 
> Not found: 
> /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: 
> No such file or directory (error 2)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to