[ 
https://issues.apache.org/jira/browse/HBASE-28053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

aplio updated HBASE-28053:
--------------------------
    Description: 
HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down

We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the 
[HBase download page|https://hbase.apache.org/downloads.html] under 
hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down 
and initiated a serverCrashProcedure, we encountered an exception. This 
exception prevented our cluster from recovering.

Below is a snippet of the exception:
{code:java}
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(300)) - Splitting 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K (16082bytes)
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
2023-08-28 21:02:52,164 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on 
file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
 after 0ms
2023-08-28 21:02:52,167 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms; 
skipped=0; 
WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K, length=16082, corrupted=false, cancelled=false
2023-08-28 21:02:52,167 ERROR 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] 
org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
at 
org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367){code}
Upon investigation, this seems to be a consequence of the changes introduced in 
Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been 
modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However, 
HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to 
the aforementioned exception. You can view the relevant HBase code [here of 
hbase 
code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].

A potential solution we identified is to rebuild HBase using a patch available 
at this repository. This appears to rectify the issue.(at least for now).
[https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change]

 

This issue helped us investigate and fix.

https://issues.apache.org/jira/browse/HBASE-26198

 

I'd like to submit a PR to the HBase documentation stating that Hadoop 3.3.1 
and later versions are not compatible with HBase (specifically version 2.5.5), 
provided that this bug is confirmed (or if my observations are accurate).

  was:
HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down

We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the 
[HBase download page|https://hbase.apache.org/downloads.html] under 
hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down 
and initiated a serverCrashProcedure, we encountered an exception. This 
exception prevented our cluster from recovering.

Below is a snippet of the exception:
{code:java}
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(300)) - Splitting 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K (16082bytes)
2023-08-28 21:02:52,163 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file 
hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
2023-08-28 21:02:52,164 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils 
(RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on 
file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
 after 0ms
2023-08-28 21:02:52,167 INFO 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
(WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms; 
skipped=0; 
WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
 size=15.7 K, length=16082, corrupted=false, cancelled=false
2023-08-28 21:02:52,167 ERROR 
[RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] 
org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
at 
org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367){code}
Upon investigation, this seems to be a consequence of the changes introduced in 
Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been 
modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However, 
HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to 
the aforementioned exception. You can view the relevant HBase code [here of 
hbase 
code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].

A potential solution we identified is to rebuild HBase using a patch available 
at this repository. This appears to rectify the issue.(at least for now).
[https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change]

 

This issue helped us investigate and fix.

https://issues.apache.org/jira/browse/HBASE-26198


> ServerCrashProcedure seems to fail when using Hadoop3.3.1+
> ----------------------------------------------------------
>
>                 Key: HBASE-28053
>                 URL: https://issues.apache.org/jira/browse/HBASE-28053
>             Project: HBase
>          Issue Type: Bug
>          Components: hadoop3, wal
>            Reporter: aplio
>            Priority: Major
>
> HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down
> We are running an HBase cluster with version 2.5.5 (HBase jar sourced from 
> the [HBase download page|https://hbase.apache.org/downloads.html] under 
> hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down 
> and initiated a serverCrashProcedure, we encountered an exception. This 
> exception prevented our cluster from recovering.
> Below is a snippet of the exception:
> {code:java}
> 2023-08-28 21:02:52,163 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
> (WALSplitter.java:splitWAL(300)) - Splitting 
> hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
>  size=15.7 K (16082bytes)
> 2023-08-28 21:02:52,163 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
> util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - 
> Recover lease on dfs file 
> hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
> 2023-08-28 21:02:52,164 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
> util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverLease(175)) - 
> Recovered lease, attempt=0 on 
> file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
>  after 0ms
> 2023-08-28 21:02:52,167 INFO 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter 
> (WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 
> ms; skipped=0; 
> WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056,
>  size=15.7 K, length=16082, corrupted=false, cancelled=false
> 2023-08-28 21:02:52,167 ERROR 
> [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] 
> handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
> java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] 
> org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
> at 
> org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367){code}
> Upon investigation, this seems to be a consequence of the changes introduced 
> in Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has 
> been modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. 
> However, HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, 
> leading to the aforementioned exception. You can view the relevant HBase code 
> [here of hbase 
> code|https://github.com/apache/hbase/blob/7ebd4381261fefd78fc2acf258a95184f4147cee/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java#L428].
> A potential solution we identified is to rebuild HBase using a patch 
> available at this repository. This appears to rectify the issue.(at least for 
> now).
> [https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change]
>  
> This issue helped us investigate and fix.
> https://issues.apache.org/jira/browse/HBASE-26198
>  
> I'd like to submit a PR to the HBase documentation stating that Hadoop 3.3.1 
> and later versions are not compatible with HBase (specifically version 
> 2.5.5), provided that this bug is confirmed (or if my observations are 
> accurate).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to