[ 
https://issues.apache.org/jira/browse/HDDS-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821386#comment-17821386
 ] 

Wei-Chiu Chuang commented on HDDS-10430:
----------------------------------------

For some reason this was triggered in one of my cluster, and then I couldn't 
get rid of it. After that every request to the OM threw this exception. It 
seems like an existing problem but somehow it was super easy to reproduce in 
that cluster.


> Potential data race with nodes array causing race condition in 
> Pipeline.getProtobufMessage
> ------------------------------------------------------------------------------------------
>
>                 Key: HDDS-10430
>                 URL: https://issues.apache.org/jira/browse/HDDS-10430
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Siyao Meng
>            Priority: Major
>              Labels: pull-request-available
>
> Found this potential bug in a HDDS-7593 derived branch (but looks more 
> general) on [~weichiu]'s cluster.
> {code}
> 24/02/26 20:39:13 INFO retry.RetryInvocationHandler: 
> com.google.protobuf.ServiceException: 
> org.apache.hadoop.ipc.RemoteException(java.lang.IndexOut0fBoundsException): 
> Index: 2, Size: 3
>         at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>         at java.util.ArrayList.get(ArrayList.java:433)
>         at 
> org.apache.hadoop.hdds.scm.pipeline.Pipeline.getProtobufMessage(Pipeline.java:414)
>         at 
> org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfo.getProtobuf(OmKeyLocationInfo.java:123)
>         at 
> org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:135)
>         at 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:652)
>         at 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:634)
>         at 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:602)
>         at 
> org.apache.hadoop.ozone.om.helpers.KeyInfoWithVolumeContext.toProtobuf(KeyInfoWithVolumeContext.java:63)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getKeyInfo(OzoneManagerRequestHandler.java:653)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:337)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:250)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:196)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:156)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
>         at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> {code}
> thrown from {{nodes.get(i)}} in snippet:
> https://github.com/apache/ozone/blob/e1d123149ecd7d6817f09a43707570513a71fbda/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java#L414
> {code:java|title=Pipeline#getProtobufMessage}
>     // To save the message size on wire, only transfer the node order based on
>     // network topology
>     List<DatanodeDetails> nodes = nodesInOrder;
>     if (!nodes.isEmpty()) {
>       for (int i = 0; i < nodes.size(); i++) {
>         Iterator<DatanodeDetails> it = nodeStatus.keySet().iterator();
>         for (int j = 0; j < nodeStatus.keySet().size(); j++) {
>           if (it.next().equals(nodes.get(i))) {
>             builder.addMemberOrders(j);
>             break;
>           }
>         }
>       }
> {code}
> *Suspected cause:*
> {{nodes}} is changed during {{nodes.get(i)}} due to lack of synchronization. 
> This is most likely caused by (at least) one other thread modifying {{nodes}} 
> array at the same time using {{setNodesInOrder()}}: cc [~sumitagrawl]
> {code}
>   public void setNodesInOrder(List<DatanodeDetails> nodes) {
>     nodesInOrder.clear();
>     if (null == nodes) {
>       return;
>     }
>     nodesInOrder.addAll(nodes);
>   }
> {code}
> {{setNodesInOrder()}} is at least called from 
> [{{sortDatanodes()}}|https://github.com/apache/ozone/blob/c1d7b433d80baf21f32f48474db49406814dd6d2/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L1867]
>  or 
> [{{allocateBlock()}}|https://github.com/apache/ozone/blob/838cc2691b0652ade8de13c0c6156d4ca1b64751/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMBlockProtocolServer.java#L213].
>  And the former is further widely called from {{lookupKey()}}, 
> {{getOzoneFileStatus()}}, {{listStatus()}}, {{getKeyInfo()}}, etc.
> [~weichiu] pls help add more relevant log details here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to