aplio created HBASE-28053:
-----------------------------

             Summary: ServerCrashProcedure seems to fail when using Hadoop3.3.1+
                 Key: HBASE-28053
                 URL: https://issues.apache.org/jira/browse/HBASE-28053
             Project: HBase
          Issue Type: Bug
          Components: hadoop3, wal
            Reporter: aplio


HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down

We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the 
[HBase download page|https://hbase.apache.org/downloads.html] under 
hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down 
and initiated a serverCrashProcedure, we encountered an exception. This 
exception prevented our cluster from recovering.

Below is a snippet of the exception:

```
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(300)) - Splitting 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K (16082bytes)
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
2023-08-28 21:02:52,164 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on 
file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
 after 0ms
2023-08-28 21:02:52,167 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms; 
skipped=0; 
WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K, length=16082, corrupted=false, cancelled=false
2023-08-28 21:02:52,167 ERROR 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] 
org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
at 
org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367)
```

Upon investigation, this seems to be a consequence of the changes introduced in 
Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been 
modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However, 
HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to 
the aforementioned exception. You can view the relevant HBase code [here of 
hbase 
code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].

A potential solution we identified is to rebuild HBase using a patch available 
at this repository. This appears to rectify the issue.(at least for now).
https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to