aplio created HBASE-28053:
-----------------------------
Summary: ServerCrashProcedure seems to fail when using Hadoop3.3.1+
Key: HBASE-28053
URL: https://issues.apache.org/jira/browse/HBASE-28053
Project: HBase
Issue Type: Bug
Components: hadoop3, wal
Reporter: aplio
HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down
We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the
[HBase download page|https://hbase.apache.org/downloads.html] under
hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down
and initiated a serverCrashProcedure, we encountered an exception. This
exception prevented our cluster from recovering.
Below is a snippet of the exception:
```
2023-08-28 21:02:52,163 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter
(WALSplitter.java:splitWAL(300)) - Splitting
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
size=15.7 K (16082bytes)
2023-08-28 21:02:52,163 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils
(RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
2023-08-28 21:02:52,164 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils
(RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on
file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
after 0ms
2023-08-28 21:02:52,167 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter
(WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms;
skipped=0;
WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
size=15.7 K, length=16082, corrupted=false, cancelled=false
2023-08-28 21:02:52,167 ERROR
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1]
handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[]
org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
at
org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367)
```
Upon investigation, this seems to be a consequence of the changes introduced in
Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been
modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However,
HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to
the aforementioned exception. You can view the relevant HBase code [here of
hbase
code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].
A potential solution we identified is to rebuild HBase using a patch available
at this repository. This appears to rectify the issue.(at least for now).
https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change
--
This message was sent by Atlassian Jira
(v8.20.10#820010)