[
https://issues.apache.org/jira/browse/HBASE-28053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
aplio updated HBASE-28053:
--------------------------
Description:
HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down
We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the
[HBase download page|https://hbase.apache.org/downloads.html] under
hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down
and initiated a serverCrashProcedure, we encountered an exception. This
exception prevented our cluster from recovering.
Below is a snippet of the exception:
```
{code:java}
2023-08-28 21:02:52,163 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter
(WALSplitter.java:splitWAL(300)) - Splitting
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
size=15.7 K (16082bytes)
2023-08-28 21:02:52,163 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils
(RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
2023-08-28 21:02:52,164 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils
(RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on
file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
after 0ms
2023-08-28 21:02:52,167 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter
(WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms;
skipped=0;
WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
size=15.7 K, length=16082, corrupted=false, cancelled=false
2023-08-28 21:02:52,167 ERROR
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1]
handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[]
org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
at
org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367){code}
Upon investigation, this seems to be a consequence of the changes introduced in
Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been
modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However,
HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to
the aforementioned exception. You can view the relevant HBase code [here of
hbase
code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].
A potential solution we identified is to rebuild HBase using a patch available
at this repository. This appears to rectify the issue.(at least for now).
[https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change]
This issue helped us investigate and fix.
https://issues.apache.org/jira/browse/HBASE-26198
was:
HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down
We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the
[HBase download page|https://hbase.apache.org/downloads.html] under
hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down
and initiated a serverCrashProcedure, we encountered an exception. This
exception prevented our cluster from recovering.
Below is a snippet of the exception:
```
2023-08-28 21:02:52,163 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter
(WALSplitter.java:splitWAL(300)) - Splitting
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
size=15.7 K (16082bytes)
2023-08-28 21:02:52,163 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils
(RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
2023-08-28 21:02:52,164 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils
(RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on
file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
after 0ms
2023-08-28 21:02:52,167 INFO
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter
(WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms;
skipped=0;
WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
size=15.7 K, length=16082, corrupted=false, cancelled=false
2023-08-28 21:02:52,167 ERROR
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1]
handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[]
org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
at
org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367)
```
Upon investigation, this seems to be a consequence of the changes introduced in
Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been
modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However,
HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to
the aforementioned exception. You can view the relevant HBase code [here of
hbase
code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].
A potential solution we identified is to rebuild HBase using a patch available
at this repository. This appears to rectify the issue.(at least for now).
https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change
> ServerCrashProcedure seems to fail when using Hadoop3.3.1+
> ----------------------------------------------------------
>
> Key: HBASE-28053
> URL: https://issues.apache.org/jira/browse/HBASE-28053
> Project: HBase
> Issue Type: Bug
> Components: hadoop3, wal
> Reporter: aplio
> Priority: Major
>
> HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down
> We are running an HBase cluster with version 2.5.5 (HBase jar sourced from
> the [HBase download page|https://hbase.apache.org/downloads.html] under
> hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down
> and initiated a serverCrashProcedure, we encountered an exception. This
> exception prevented our cluster from recovering.
> Below is a snippet of the exception:
> ```
> {code:java}
> 2023-08-28 21:02:52,163 INFO
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter
> (WALSplitter.java:splitWAL(300)) - Splitting
> hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
> size=15.7 K (16082bytes)
> 2023-08-28 21:02:52,163 INFO
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1]
> util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) -
> Recover lease on dfs file
> hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
> 2023-08-28 21:02:52,164 INFO
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1]
> util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverLease(175)) -
> Recovered lease, attempt=0 on
> file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
> after 0ms
> 2023-08-28 21:02:52,167 INFO
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter
> (WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4
> ms; skipped=0;
> WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
> size=15.7 K, length=16082, corrupted=false, cancelled=false
> 2023-08-28 21:02:52,167 ERROR
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1]
> handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
> java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[]
> org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
> at
> org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367){code}
> Upon investigation, this seems to be a consequence of the changes introduced
> in Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has
> been modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[].
> However, HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428,
> leading to the aforementioned exception. You can view the relevant HBase code
> [here of hbase
> code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].
> A potential solution we identified is to rebuild HBase using a patch
> available at this repository. This appears to rectify the issue.(at least for
> now).
> [https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change]
>
> This issue helped us investigate and fix.
> https://issues.apache.org/jira/browse/HBASE-26198
--
This message was sent by Atlassian Jira
(v8.20.10#820010)