[
https://issues.apache.org/jira/browse/HDDS-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siyao Meng updated HDDS-10430:
------------------------------
Description:
Found this potential bug in a HDDS-7593 derived branch (but looks more general)
on [~weichiu]'s cluster.
{code}
24/02/26 20:39:13 INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException:
org.apache.hadoop.ipc.RemoteException(java.lang.IndexOut0fBoundsException):
Index: 2, Size: 3
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at
org.apache.hadoop.hdds.scm.pipeline.Pipeline.getProtobufMessage(Pipeline.java:414)
at
org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfo.getProtobuf(OmKeyLocationInfo.java:123)
at
org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:135)
at
org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:652)
at
org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:634)
at
org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:602)
at
org.apache.hadoop.ozone.om.helpers.KeyInfoWithVolumeContext.toProtobuf(KeyInfoWithVolumeContext.java:63)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getKeyInfo(OzoneManagerRequestHandler.java:653)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:337)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:250)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:196)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:156)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
at
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
{code}
thrown from {{nodes.get(i)}} in snippet:
https://github.com/apache/ozone/blob/e1d123149ecd7d6817f09a43707570513a71fbda/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java#L414
{code:java|title=Pipeline#getProtobufMessage}
// To save the message size on wire, only transfer the node order based on
// network topology
List<DatanodeDetails> nodes = nodesInOrder;
if (!nodes.isEmpty()) {
for (int i = 0; i < nodes.size(); i++) {
Iterator<DatanodeDetails> it = nodeStatus.keySet().iterator();
for (int j = 0; j < nodeStatus.keySet().size(); j++) {
if (it.next().equals(nodes.get(i))) {
builder.addMemberOrders(j);
break;
}
}
}
{code}
*Suspected cause:*
{{nodes}} is changed during {{nodes.get(i)}} due to lack of synchronization.
This is most likely caused by (at least) one other thread modifying {{nodes}}
array at the same time using {{setNodesInOrder()}}: cc [~sumitagrawl]
{code}
public void setNodesInOrder(List<DatanodeDetails> nodes) {
nodesInOrder.clear();
if (null == nodes) {
return;
}
nodesInOrder.addAll(nodes);
}
{code}
{{setNodesInOrder()}} is at least called from
[{{sortDatanodes()}}|https://github.com/apache/ozone/blob/c1d7b433d80baf21f32f48474db49406814dd6d2/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L1867]
or
[{{allocateBlock()}}|https://github.com/apache/ozone/blob/838cc2691b0652ade8de13c0c6156d4ca1b64751/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMBlockProtocolServer.java#L213].
And the former is further widely called from {{lookupKey()}},
{{getOzoneFileStatus()}}, {{listStatus()}}, {{getKeyInfo()}}, etc.
[~weichiu] pls help add more relevant log details here.
was:
Found this potential bug in a HDDS-7593 derived branch (but looks more general)
on [~weichiu]'s cluster.
{code}
24/02/26 20:39:13 INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException:
org.apache.hadoop.ipc.RemoteException(java.lang.IndexOut0fBoundsException):
Index: 2, Size: 3
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList. java:433)
at
org.apache.hadoop.hdds.scm.pipeline.Pipeline.getProtobufMessage(Pipeline.java:414)
...
{code}
thrown from {{nodes.get(i)}} in snippet:
https://github.com/apache/ozone/blob/e1d123149ecd7d6817f09a43707570513a71fbda/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java#L414
{code:java|title=Pipeline#getProtobufMessage}
// To save the message size on wire, only transfer the node order based on
// network topology
List<DatanodeDetails> nodes = nodesInOrder;
if (!nodes.isEmpty()) {
for (int i = 0; i < nodes.size(); i++) {
Iterator<DatanodeDetails> it = nodeStatus.keySet().iterator();
for (int j = 0; j < nodeStatus.keySet().size(); j++) {
if (it.next().equals(nodes.get(i))) {
builder.addMemberOrders(j);
break;
}
}
}
{code}
*Suspected cause:*
{{nodes}} is changed during {{nodes.get(i)}} due to lack of synchronization.
This is most likely caused by (at least) one other thread modifying {{nodes}}
array at the same time using {{setNodesInOrder()}}: cc [~sumitagrawl]
{code}
public void setNodesInOrder(List<DatanodeDetails> nodes) {
nodesInOrder.clear();
if (null == nodes) {
return;
}
nodesInOrder.addAll(nodes);
}
{code}
{{setNodesInOrder()}} is at least called from
[{{sortDatanodes()}}|https://github.com/apache/ozone/blob/c1d7b433d80baf21f32f48474db49406814dd6d2/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L1867]
or
[{{allocateBlock()}}|https://github.com/apache/ozone/blob/838cc2691b0652ade8de13c0c6156d4ca1b64751/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMBlockProtocolServer.java#L213].
And the former is further widely called from {{lookupKey()}},
{{getOzoneFileStatus()}}, {{listStatus()}}, {{getKeyInfo()}}, etc.
[~weichiu] pls help add more relevant log details here.
> Potential data race with nodes array, causing race condition in
> Pipeline.getProtobufMessage
> -------------------------------------------------------------------------------------------
>
> Key: HDDS-10430
> URL: https://issues.apache.org/jira/browse/HDDS-10430
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Siyao Meng
> Priority: Major
>
> Found this potential bug in a HDDS-7593 derived branch (but looks more
> general) on [~weichiu]'s cluster.
> {code}
> 24/02/26 20:39:13 INFO retry.RetryInvocationHandler:
> com.google.protobuf.ServiceException:
> org.apache.hadoop.ipc.RemoteException(java.lang.IndexOut0fBoundsException):
> Index: 2, Size: 3
> at java.util.ArrayList.rangeCheck(ArrayList.java:657)
> at java.util.ArrayList.get(ArrayList.java:433)
> at
> org.apache.hadoop.hdds.scm.pipeline.Pipeline.getProtobufMessage(Pipeline.java:414)
> at
> org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfo.getProtobuf(OmKeyLocationInfo.java:123)
> at
> org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:135)
> at
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:652)
> at
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:634)
> at
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:602)
> at
> org.apache.hadoop.ozone.om.helpers.KeyInfoWithVolumeContext.toProtobuf(KeyInfoWithVolumeContext.java:63)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getKeyInfo(OzoneManagerRequestHandler.java:653)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:337)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:250)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:196)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:156)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> {code}
> thrown from {{nodes.get(i)}} in snippet:
> https://github.com/apache/ozone/blob/e1d123149ecd7d6817f09a43707570513a71fbda/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java#L414
> {code:java|title=Pipeline#getProtobufMessage}
> // To save the message size on wire, only transfer the node order based on
> // network topology
> List<DatanodeDetails> nodes = nodesInOrder;
> if (!nodes.isEmpty()) {
> for (int i = 0; i < nodes.size(); i++) {
> Iterator<DatanodeDetails> it = nodeStatus.keySet().iterator();
> for (int j = 0; j < nodeStatus.keySet().size(); j++) {
> if (it.next().equals(nodes.get(i))) {
> builder.addMemberOrders(j);
> break;
> }
> }
> }
> {code}
> *Suspected cause:*
> {{nodes}} is changed during {{nodes.get(i)}} due to lack of synchronization.
> This is most likely caused by (at least) one other thread modifying {{nodes}}
> array at the same time using {{setNodesInOrder()}}: cc [~sumitagrawl]
> {code}
> public void setNodesInOrder(List<DatanodeDetails> nodes) {
> nodesInOrder.clear();
> if (null == nodes) {
> return;
> }
> nodesInOrder.addAll(nodes);
> }
> {code}
> {{setNodesInOrder()}} is at least called from
> [{{sortDatanodes()}}|https://github.com/apache/ozone/blob/c1d7b433d80baf21f32f48474db49406814dd6d2/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L1867]
> or
> [{{allocateBlock()}}|https://github.com/apache/ozone/blob/838cc2691b0652ade8de13c0c6156d4ca1b64751/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMBlockProtocolServer.java#L213].
> And the former is further widely called from {{lookupKey()}},
> {{getOzoneFileStatus()}}, {{listStatus()}}, {{getKeyInfo()}}, etc.
> [~weichiu] pls help add more relevant log details here.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]