SunQiang created HBASE-29118:
--------------------------------
Summary: HBase RS_LOG_REPLAY Failed: Failed to recover
lease(Cannot recoverLease)
Key: HBASE-29118
URL: https://issues.apache.org/jira/browse/HBASE-29118
Project: HBase
Issue Type: Bug
Components: wal
Affects Versions: 2.1.10
Reporter: SunQiang
背景
Rgionserver crashed 2 nodes (currently deployed on the same node as
Regionalserver and Datanode), and the Master recovered the Wal logs. However,
one of the Wal logs got stuck during recovery, causing the recovery process to
be blocked and some partitions to be unable to read or write and used;
*一、The Master log shows that a Wal log has been restored 26 times*
2024-12-28 14:22:38,942 INFO
[master/pro-alihbaseprod-hmaster-al01-054179:16000.splitLogManager..Chore.1]
master.SplitLogManager: total=1, unassigned=0,
tasks={/hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379=last_update
= 1735366845774 *{color:#FF0000}last_version = 26{color}* cur_worker_name =
10.xxx.xxx.252,16020,1724858168109 status = in_progress incarnation = 0
resubmits = 0 batch = installed = 101 done = 100 error = 0}
{color:#FF0000}*二、Related Regional Server logs*{color}
2024-12-28 14:04:44,210 INFO [SplitLogWorker-10.xxx.xxx.252:16020]
coordination.ZkSplitLogWorkerCoordination: worker
10.xxx.xxx.252,16020,1724858168109 {color:#FF0000}acquired task{color}
/hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379
--获取wal所有权
2024-12-28 14:04:44,211 INFO [main-EventThread]
coordination.ZkSplitLogWorkerCoordination: task
/hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379
{color:#FF0000}preempted from{color}
pro-alihbaseprod-al01-243252,16020,1724858168109, current task state and
owner=OWNED 10.xxx.xxx.252,16020,1724858168109
--But here it shows wal log size=0
2024-12-28 14:04:44,235 INFO
[RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1]
wal.WALSplitter: Splitting
WAL=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379,{color:#FF0000}
*size=0 (0 bytes)*{color}
*--This indicates that the file is still open*
2024-12-28 14:04:44,236 WARN
[RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1]
wal.WALSplitter: File
hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
{color:#FF0000}might be still open, length is 0{color}
*--Finally, this prompt indicates that obtaining the lease agreement has failed*
2024-12-28 14:04:44,239 INFO
[RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1]
util.FSHDFSUtils: {color:#FF0000}Failed to recover lease{color}, attempt=0 on
file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
after 3ms
2024-12-28 14:20:49,781 WARN
[RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1]
util.FSHDFSUtils: {color:#FF0000}Cannot recoverLease after trying for 900000ms
(hbase.lease.recovery.timeout); continuing, but may be DATALOSS!!!{color};
attempt=6 on
file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
after 965545ms
What is the reason why the service only resumed after switching to the Master
node? How to quickly solve it?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)