[ 
https://issues.apache.org/jira/browse/HBASE-28053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

aplio updated HBASE-28053:
--------------------------
    Description: 
HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down

We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the 
[HBase download page|https://hbase.apache.org/downloads.html] under 
hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down 
and initiated a serverCrashProcedure, we encountered an exception. This 
exception prevented our cluster from recovering.

Below is a snippet of the exception:

```
{code:java}
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(300)) - Splitting 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K (16082bytes)
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
2023-08-28 21:02:52,164 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on 
file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
 after 0ms
2023-08-28 21:02:52,167 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms; 
skipped=0; 
WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K, length=16082, corrupted=false, cancelled=false
2023-08-28 21:02:52,167 ERROR 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] 
org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
at 
org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367){code}
Upon investigation, this seems to be a consequence of the changes introduced in 
Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been 
modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However, 
HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to 
the aforementioned exception. You can view the relevant HBase code [here of 
hbase 
code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].

A potential solution we identified is to rebuild HBase using a patch available 
at this repository. This appears to rectify the issue.(at least for now).
[https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change]

 

This issue helped us investigate and fix.

https://issues.apache.org/jira/browse/HBASE-26198

  was:
HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down

We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the 
[HBase download page|https://hbase.apache.org/downloads.html] under 
hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down 
and initiated a serverCrashProcedure, we encountered an exception. This 
exception prevented our cluster from recovering.

Below is a snippet of the exception:

```
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(300)) - Splitting 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K (16082bytes)
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
2023-08-28 21:02:52,164 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on 
file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
 after 0ms
2023-08-28 21:02:52,167 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms; 
skipped=0; 
WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K, length=16082, corrupted=false, cancelled=false
2023-08-28 21:02:52,167 ERROR 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] 
org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
at 
org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367)
```

Upon investigation, this seems to be a consequence of the changes introduced in 
Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been 
modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However, 
HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to 
the aforementioned exception. You can view the relevant HBase code [here of 
hbase 
code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].

A potential solution we identified is to rebuild HBase using a patch available 
at this repository. This appears to rectify the issue.(at least for now).
https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change


> ServerCrashProcedure seems to fail when using Hadoop3.3.1+
> ----------------------------------------------------------
>
>                 Key: HBASE-28053
>                 URL: https://issues.apache.org/jira/browse/HBASE-28053
>             Project: HBase
>          Issue Type: Bug
>          Components: hadoop3, wal
>            Reporter: aplio
>            Priority: Major
>
> HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down
> We are running an HBase cluster with version 2.5.5 (HBase jar sourced from 
> the [HBase download page|https://hbase.apache.org/downloads.html] under 
> hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down 
> and initiated a serverCrashProcedure, we encountered an exception. This 
> exception prevented our cluster from recovering.
> Below is a snippet of the exception:
> ```
> {code:java}
> 2023-08-28 21:02:52,163 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
> (WALSplitter.java:splitWAL(300)) - Splitting 
> hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
>  size=15.7 K (16082bytes)
> 2023-08-28 21:02:52,163 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
> util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - 
> Recover lease on dfs file 
> hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
> 2023-08-28 21:02:52,164 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
> util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverLease(175)) - 
> Recovered lease, attempt=0 on 
> file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
>  after 0ms
> 2023-08-28 21:02:52,167 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
> (WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 
> ms; skipped=0; 
> WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
>  size=15.7 K, length=16082, corrupted=false, cancelled=false
> 2023-08-28 21:02:52,167 ERROR 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
> handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
> java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] 
> org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
> at 
> org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367){code}
> Upon investigation, this seems to be a consequence of the changes introduced 
> in Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has 
> been modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. 
> However, HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, 
> leading to the aforementioned exception. You can view the relevant HBase code 
> [here of hbase 
> code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].
> A potential solution we identified is to rebuild HBase using a patch 
> available at this repository. This appears to rectify the issue.(at least for 
> now).
> [https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change]
>  
> This issue helped us investigate and fix.
> https://issues.apache.org/jira/browse/HBASE-26198



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to