[jira] [Created] (HBASE-29118) HBase RS_LOG_REPLAY Failed: Failed to recover lease(Cannot recoverLease)

SunQiang (Jira) Sun, 09 Feb 2025 18:15:44 -0800

SunQiang created HBASE-29118:
--------------------------------

             Summary: HBase RS_LOG_REPLAY Failed: Failed to recover 
lease(Cannot recoverLease)
                 Key: HBASE-29118
                 URL: https://issues.apache.org/jira/browse/HBASE-29118
             Project: HBase
          Issue Type: Bug
          Components: wal
    Affects Versions: 2.1.10
            Reporter: SunQiang



背景

Rgionserver crashed 2 nodes (currently deployed on the same node as 
Regionalserver and Datanode), and the Master recovered the Wal logs. However, 
one of the Wal logs got stuck during recovery, causing the recovery process to 
be blocked and some partitions to be unable to read or write and used;

 

*一、The Master log shows that a Wal log has been restored 26 times*

2024-12-28 14:22:38,942 INFO 
[master/pro-alihbaseprod-hmaster-al01-054179:16000.splitLogManager..Chore.1] 
master.SplitLogManager: total=1, unassigned=0, 
tasks={/hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379=last_update
 = 1735366845774 *{color:#FF0000}last_version = 26{color}* cur_worker_name = 
10.xxx.xxx.252,16020,1724858168109 status = in_progress incarnation = 0 
resubmits = 0 batch = installed = 101 done = 100 error = 0}

 

{color:#FF0000}*二、Related Regional Server logs*{color}

2024-12-28 14:04:44,210 INFO [SplitLogWorker-10.xxx.xxx.252:16020] 
coordination.ZkSplitLogWorkerCoordination: worker 
10.xxx.xxx.252,16020,1724858168109 {color:#FF0000}acquired task{color} 
/hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379

--获取wal所有权

2024-12-28 14:04:44,211 INFO [main-EventThread] 
coordination.ZkSplitLogWorkerCoordination: task 
/hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379
 {color:#FF0000}preempted from{color} 
pro-alihbaseprod-al01-243252,16020,1724858168109, current task state and 
owner=OWNED 10.xxx.xxx.252,16020,1724858168109

--But here it shows wal log size=0

2024-12-28 14:04:44,235 INFO 
[RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] 
wal.WALSplitter: Splitting 
WAL=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379,{color:#FF0000}
 *size=0 (0 bytes)*{color}

*--This indicates that the file is still open*

2024-12-28 14:04:44,236 WARN 
[RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] 
wal.WALSplitter: File 
hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
 {color:#FF0000}might be still open, length is 0{color}

*--Finally, this prompt indicates that obtaining the lease agreement has failed*

2024-12-28 14:04:44,239 INFO 
[RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] 
util.FSHDFSUtils: {color:#FF0000}Failed to recover lease{color}, attempt=0 on 
file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
 after 3ms

2024-12-28 14:20:49,781 WARN 
[RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] 
util.FSHDFSUtils: {color:#FF0000}Cannot recoverLease after trying for 900000ms 
(hbase.lease.recovery.timeout); continuing, but may be DATALOSS!!!{color}; 
attempt=6 on 
file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
 after 965545ms

 

 

What is the reason why the service only resumed after switching to the Master 
node? How to quickly solve it?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HBASE-29118) HBase RS_LOG_REPLAY Failed: Failed to recover lease(Cannot recoverLease)

Reply via email to