[
https://issues.apache.org/jira/browse/HBASE-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17925466#comment-17925466
]
SunQiang commented on HBASE-29118:
----------------------------------
[~chaijunjie] Hello; Can you help me take a look at this issue? Is it caused by
the use of Mullwal?
> HBase RS_LOG_REPLAY Failed: Failed to recover lease(Cannot recoverLease)
> ------------------------------------------------------------------------
>
> Key: HBASE-29118
> URL: https://issues.apache.org/jira/browse/HBASE-29118
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 2.1.10
> Reporter: SunQiang
> Priority: Major
>
> 背景
> Rgionserver crashed 2 nodes (currently deployed on the same node as
> Regionalserver and Datanode), and the Master recovered the Wal logs. However,
> one of the Wal logs got stuck during recovery, causing the recovery process
> to be blocked and some partitions to be unable to read or write and used;
>
> *一、The Master log shows that a Wal log has been restored 26 times*
> 2024-12-28 14:22:38,942 INFO
> [master/pro-alihbaseprod-hmaster-al01-054179:16000.splitLogManager..Chore.1]
> master.SplitLogManager: total=1, unassigned=0,
> tasks={/hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379=last_update
> = 1735366845774 *{color:#FF0000}last_version = 26{color}* cur_worker_name =
> 10.xxx.xxx.252,16020,1724858168109 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 101 done = 100 error = 0}
>
> {color:#FF0000}*二、Related Regional Server logs*{color}
> 2024-12-28 14:04:44,210 INFO [SplitLogWorker-10.xxx.xxx.252:16020]
> coordination.ZkSplitLogWorkerCoordination: worker
> 10.xxx.xxx.252,16020,1724858168109 {color:#FF0000}acquired task{color}
> /hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379
> --获取wal所有权
> 2024-12-28 14:04:44,211 INFO [main-EventThread]
> coordination.ZkSplitLogWorkerCoordination: task
> /hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379
> {color:#FF0000}preempted from{color}
> pro-alihbaseprod-al01-243252,16020,1724858168109, current task state and
> owner=OWNED 10.xxx.xxx.252,16020,1724858168109
> --But here it shows wal log size=0
> 2024-12-28 14:04:44,235 INFO
> [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1]
> wal.WALSplitter: Splitting
> WAL=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379,{color:#FF0000}
> *size=0 (0 bytes)*{color}
> *--This indicates that the file is still open*
> 2024-12-28 14:04:44,236 WARN
> [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1]
> wal.WALSplitter: File
> hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
> {color:#FF0000}might be still open, length is 0{color}
> *--Finally, this prompt indicates that obtaining the lease agreement has
> failed*
> 2024-12-28 14:04:44,239 INFO
> [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1]
> util.FSHDFSUtils: {color:#FF0000}Failed to recover lease{color}, attempt=0 on
> file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
> after 3ms
> 2024-12-28 14:20:49,781 WARN
> [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1]
> util.FSHDFSUtils: {color:#FF0000}Cannot recoverLease after trying for
> 900000ms (hbase.lease.recovery.timeout); continuing, but may be
> DATALOSS!!!{color}; attempt=6 on
> file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
> after 965545ms
>
>
> What is the reason why the service only resumed after switching to the Master
> node? How to quickly solve it?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)