SunQiang created HBASE-29118: -------------------------------- Summary: HBase RS_LOG_REPLAY Failed: Failed to recover lease(Cannot recoverLease) Key: HBASE-29118 URL: https://issues.apache.org/jira/browse/HBASE-29118 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.1.10 Reporter: SunQiang
背景 Rgionserver crashed 2 nodes (currently deployed on the same node as Regionalserver and Datanode), and the Master recovered the Wal logs. However, one of the Wal logs got stuck during recovery, causing the recovery process to be blocked and some partitions to be unable to read or write and used; *一、The Master log shows that a Wal log has been restored 26 times* 2024-12-28 14:22:38,942 INFO [master/pro-alihbaseprod-hmaster-al01-054179:16000.splitLogManager..Chore.1] master.SplitLogManager: total=1, unassigned=0, tasks={/hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379=last_update = 1735366845774 *{color:#FF0000}last_version = 26{color}* cur_worker_name = 10.xxx.xxx.252,16020,1724858168109 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 101 done = 100 error = 0} {color:#FF0000}*二、Related Regional Server logs*{color} 2024-12-28 14:04:44,210 INFO [SplitLogWorker-10.xxx.xxx.252:16020] coordination.ZkSplitLogWorkerCoordination: worker 10.xxx.xxx.252,16020,1724858168109 {color:#FF0000}acquired task{color} /hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379 --获取wal所有权 2024-12-28 14:04:44,211 INFO [main-EventThread] coordination.ZkSplitLogWorkerCoordination: task /hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379 {color:#FF0000}preempted from{color} pro-alihbaseprod-al01-243252,16020,1724858168109, current task state and owner=OWNED 10.xxx.xxx.252,16020,1724858168109 --But here it shows wal log size=0 2024-12-28 14:04:44,235 INFO [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] wal.WALSplitter: Splitting WAL=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379,{color:#FF0000} *size=0 (0 bytes)*{color} *--This indicates that the file is still open* 2024-12-28 14:04:44,236 WARN [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] wal.WALSplitter: File hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379 {color:#FF0000}might be still open, length is 0{color} *--Finally, this prompt indicates that obtaining the lease agreement has failed* 2024-12-28 14:04:44,239 INFO [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] util.FSHDFSUtils: {color:#FF0000}Failed to recover lease{color}, attempt=0 on file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379 after 3ms 2024-12-28 14:20:49,781 WARN [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] util.FSHDFSUtils: {color:#FF0000}Cannot recoverLease after trying for 900000ms (hbase.lease.recovery.timeout); continuing, but may be DATALOSS!!!{color}; attempt=6 on file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379 after 965545ms What is the reason why the service only resumed after switching to the Master node? How to quickly solve it? -- This message was sent by Atlassian Jira (v8.20.10#820010)