Wei-Chiu Chuang created HDDS-11558:
--------------------------------------
Summary: HBase RegionServer crashes due to inconsistency caused by
Ozone client failover handling
Key: HDDS-11558
URL: https://issues.apache.org/jira/browse/HDDS-11558
Project: Apache Ozone
Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Wei-Chiu Chuang
We found HBase RegionServer crashes after a few days. I was able to reproduce
it and confirm Ozone client failover handling can cause unexpected behavior,
which caused HBase RegionServer crash.
The RS crashes because it renamed a file that failed with an exception,
however, it actually succeeded on the OM side. RS retried but because rename
already happened, it returned a -1. This is unexpected so RS crashed.
RS code
[https://github.com/apache/hbase/blob/52e9c0fb9c4fc0fdd42801359171356d77c74a90/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionFileSystem.java#L1093]
{code:java}
boolean rename(Path srcpath, Path dstPath) throws IOException {
IOException lastIOE = null;
int i = 0;
do {
try {
return fs.rename(srcpath, dstPath);
} catch (IOException ioe) {
lastIOE = ioe;
if (!fs.exists(srcpath) && fs.exists(dstPath)) return true; //
successful move
// dir is not there, retry after some time.
try {
sleepBeforeRetry("Rename Directory", i + 1);
} catch (InterruptedException e) {
throw (InterruptedIOException) new
InterruptedIOException().initCause(e);
}
}
} while (++i <= hdfsClientRetriesNumber);
throw new IOException("Exception in rename", lastIOE);
}
{code}
Reproduction steps:
1. Suppose OM1 was leader, OM2 and OM3 were followers.
2. pause follower OM2 and follower 3
3. issue rename command
hdfs dfs -touchz ofs://ozone1728456768/test1/buck1/src
hdfs dfs -mv ofs://ozone1728456768/test1/buck1/src
ofs://ozone1728456768/test1/buck1/dst
4. pause leader OM1
5. wait 5 seconds
6. resume follower OM2 and OM3
7. wait 5 seconds
8. resume leader OM1
{noformat}
24/10/09 23:30:32 INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException):
OM:om1546335780 is not the leader. Could not determine the leader node.
at
org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException.convertToOMNotLeaderException(OMNotLeaderException.java:93)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponseImpl(OzoneManagerRatisServer.java:497)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$2(OzoneManagerRatisServer.java:287)
at org.apache.hadoop.ozone.util.MetricUtil.captureLatencyNs(MetricUtil.java:46)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponse(OzoneManagerRatisServer.java:285)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:265)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:254)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:228)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:162)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:153)
at
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:995)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:923)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1910)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2901)
, while invoking $Proxy11.submitRequest over
nodeId=om1,nodeAddress=ccycloud-1.nightly7310-hi.root.comops.site:9862. Trying
to failover immediately.
24/10/09 23:30:32 ERROR ozone.BasicRootedOzoneFileSystem: rename key failed:
Unable to get file status: volume: test1 bucket: buck1 key: src.
source:test1/buck1/src, destin:test1/buck1/dst
mv: `ofs://ozone1728456768/test1/buck1/src': Input/output error{noformat}
The rename succeeded at OM:
hdfs dfs -ls ofs://ozone1728456768/test1/buck1/src
ls: `ofs://ozone1728456768/test1/buck1/src': No such file or directory
hdfs dfs -ls ofs://ozone1728456768/test1/buck1/dst
-rw-rw-rw- 3 hive hive 0 2024-10-09 23:29 ofs://ozone1728456768/test1/buck1/dst
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]