[jira] [Commented] (HBASE-29118) HBase RS_LOG_REPLAY Failed: Failed to recover lease(Cannot recoverLease)

SunQiang (Jira) Sun, 09 Feb 2025 19:22:39 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17925466#comment-17925466
 ]


SunQiang commented on HBASE-29118:
----------------------------------

[~chaijunjie] Hello; Can you help me take a look at this issue? Is it caused by 
the use of Mullwal?

> HBase RS_LOG_REPLAY Failed: Failed to recover lease(Cannot recoverLease)
> ------------------------------------------------------------------------
>
>                 Key: HBASE-29118
>                 URL: https://issues.apache.org/jira/browse/HBASE-29118
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 2.1.10
>            Reporter: SunQiang
>            Priority: Major
>
> 背景
> Rgionserver crashed 2 nodes (currently deployed on the same node as 
> Regionalserver and Datanode), and the Master recovered the Wal logs. However, 
> one of the Wal logs got stuck during recovery, causing the recovery process 
> to be blocked and some partitions to be unable to read or write and used;
>  
> *一、The Master log shows that a Wal log has been restored 26 times*
> 2024-12-28 14:22:38,942 INFO 
> [master/pro-alihbaseprod-hmaster-al01-054179:16000.splitLogManager..Chore.1] 
> master.SplitLogManager: total=1, unassigned=0, 
> tasks={/hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379=last_update
>  = 1735366845774 *{color:#FF0000}last_version = 26{color}* cur_worker_name = 
> 10.xxx.xxx.252,16020,1724858168109 status = in_progress incarnation = 0 
> resubmits = 0 batch = installed = 101 done = 100 error = 0}
>  
> {color:#FF0000}*二、Related Regional Server logs*{color}
> 2024-12-28 14:04:44,210 INFO [SplitLogWorker-10.xxx.xxx.252:16020] 
> coordination.ZkSplitLogWorkerCoordination: worker 
> 10.xxx.xxx.252,16020,1724858168109 {color:#FF0000}acquired task{color} 
> /hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379
> --获取wal所有权
> 2024-12-28 14:04:44,211 INFO [main-EventThread] 
> coordination.ZkSplitLogWorkerCoordination: task 
> /hbase/splitWAL/WALs%2F10.xxx.xxx.7%2C16020%2C1724855974451-splitting%2F10.xxx.xxx.7%252C16020%252C1724855974451.10.xxx.xxx.7%252C16020%252C1724855974451.regiongroup-1.1735365706379
>  {color:#FF0000}preempted from{color} 
> pro-alihbaseprod-al01-243252,16020,1724858168109, current task state and 
> owner=OWNED 10.xxx.xxx.252,16020,1724858168109
> --But here it shows wal log size=0
> 2024-12-28 14:04:44,235 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] 
> wal.WALSplitter: Splitting 
> WAL=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379,{color:#FF0000}
>  *size=0 (0 bytes)*{color}
> *--This indicates that the file is still open*
> 2024-12-28 14:04:44,236 WARN 
> [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] 
> wal.WALSplitter: File 
> hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
>  {color:#FF0000}might be still open, length is 0{color}
> *--Finally, this prompt indicates that obtaining the lease agreement has 
> failed*
> 2024-12-28 14:04:44,239 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] 
> util.FSHDFSUtils: {color:#FF0000}Failed to recover lease{color}, attempt=0 on 
> file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
>  after 3ms
> 2024-12-28 14:20:49,781 WARN 
> [RS_LOG_REPLAY_OPS-regionserver/pro-alihbaseprod-al01-243252:16020-1] 
> util.FSHDFSUtils: {color:#FF0000}Cannot recoverLease after trying for 
> 900000ms (hbase.lease.recovery.timeout); continuing, but may be 
> DATALOSS!!!{color}; attempt=6 on 
> file=hdfs://aliHBaseProd/hbase/WALs/10.xxx.xxx.7,16020,1724855974451-splitting/10.xxx.xxx.7%2C16020%2C1724855974451.10.xxx.xxx.7%2C16020%2C1724855974451.regiongroup-1.1735365706379
>  after 965545ms
>  
>  
> What is the reason why the service only resumed after switching to the Master 
> node? How to quickly solve it?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-29118) HBase RS_LOG_REPLAY Failed: Failed to recover lease(Cannot recoverLease)

Reply via email to