[ 
https://issues.apache.org/jira/browse/HDDS-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDDS-10430:
------------------------------
    Description: 
Found this potential bug in a HDDS-7593 derived branch (but looks more general) 
on [~weichiu]'s cluster.

{code}
24/02/26 20:39:13 INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: 
org.apache.hadoop.ipc.RemoteException(java.lang.IndexOut0fBoundsException): 
Index: 2, Size: 3
        at java.util.ArrayList.rangeCheck(ArrayList.java:657)
        at java.util.ArrayList.get(ArrayList. java:433)
        at 
org.apache.hadoop.hdds.scm.pipeline.Pipeline.getProtobufMessage(Pipeline.java:414)
...
{code}

thrown from {{nodes.get(i)}} in snippet:

https://github.com/apache/ozone/blob/e1d123149ecd7d6817f09a43707570513a71fbda/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java#L414

{code:java|title=Pipeline#getProtobufMessage}
    // To save the message size on wire, only transfer the node order based on
    // network topology
    List<DatanodeDetails> nodes = nodesInOrder;
    if (!nodes.isEmpty()) {
      for (int i = 0; i < nodes.size(); i++) {
        Iterator<DatanodeDetails> it = nodeStatus.keySet().iterator();
        for (int j = 0; j < nodeStatus.keySet().size(); j++) {
          if (it.next().equals(nodes.get(i))) {
            builder.addMemberOrders(j);
            break;
          }
        }
      }
{code}

*Suspected cause:*

{{nodes}} is changed during {{nodes.get(i)}} due to lack of synchronization. 
This is most likely caused by (at least) one other thread modifying {{nodes}} 
array at the same time using {{setNodesInOrder()}}: cc [~sumitagrawl]

{code}
  public void setNodesInOrder(List<DatanodeDetails> nodes) {
    nodesInOrder.clear();
    if (null == nodes) {
      return;
    }
    nodesInOrder.addAll(nodes);
  }
{code}

{{setNodesInOrder()}} is at least called from 
[{{sortDatanodes()}}|https://github.com/apache/ozone/blob/c1d7b433d80baf21f32f48474db49406814dd6d2/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L1867]
 or 
[{{allocateBlock()}}|https://github.com/apache/ozone/blob/838cc2691b0652ade8de13c0c6156d4ca1b64751/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMBlockProtocolServer.java#L213].
 And the former is further widely called from {{lookupKey()}}, 
{{getOzoneFileStatus()}}, {{listStatus()}}, {{getKeyInfo()}}, etc.

[~weichiu] pls help add more relevant log details here.

  was:
Found this potential bug in a HDDS-7593 derived branch (but looks more general) 
on [~weichiu]'s cluster.

{code}
24/02/26 20:39:13 INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: org.apache.hadoop. 
ipc.RemoteException(java.lang.IndexOut0fBoundsException): Index: 2, Size: 3
        at java.util.ArrayList.rangeCheck(ArrayList.java:657)
        at java.util.ArrayList.get(ArrayList. java:433)
        at 
org.apache.hadoop.hdds.scm.pipeline.Pipeline.getProtobufMessage(Pipeline.java:414)
...
{code}

thrown from {{nodes.get(i)}} in snippet:

https://github.com/apache/ozone/blob/e1d123149ecd7d6817f09a43707570513a71fbda/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java#L414

{code:java|title=Pipeline#getProtobufMessage}
    // To save the message size on wire, only transfer the node order based on
    // network topology
    List<DatanodeDetails> nodes = nodesInOrder;
    if (!nodes.isEmpty()) {
      for (int i = 0; i < nodes.size(); i++) {
        Iterator<DatanodeDetails> it = nodeStatus.keySet().iterator();
        for (int j = 0; j < nodeStatus.keySet().size(); j++) {
          if (it.next().equals(nodes.get(i))) {
            builder.addMemberOrders(j);
            break;
          }
        }
      }
{code}

*Suspected cause:*

{{nodes}} is changed during {{nodes.get(i)}} due to lack of synchronization. 
This is most likely caused by (at least) one other thread modifying {{nodes}} 
array at the same time using {{setNodesInOrder()}}: cc [~sumitagrawl]

{code}
  public void setNodesInOrder(List<DatanodeDetails> nodes) {
    nodesInOrder.clear();
    if (null == nodes) {
      return;
    }
    nodesInOrder.addAll(nodes);
  }
{code}

{{setNodesInOrder()}} is at least called from 
[{{sortDatanodes()}}|https://github.com/apache/ozone/blob/c1d7b433d80baf21f32f48474db49406814dd6d2/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L1867]
 or 
[{{allocateBlock()}}|https://github.com/apache/ozone/blob/838cc2691b0652ade8de13c0c6156d4ca1b64751/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMBlockProtocolServer.java#L213].
 And the former is further widely called from {{lookupKey()}}, 
{{getOzoneFileStatus()}}, {{listStatus()}}, {{getKeyInfo()}}, etc.

[~weichiu] pls help add more relevant log details here.


> Potential data race with nodes array, causing race condition in 
> Pipeline.getProtobufMessage
> -------------------------------------------------------------------------------------------
>
>                 Key: HDDS-10430
>                 URL: https://issues.apache.org/jira/browse/HDDS-10430
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Siyao Meng
>            Priority: Major
>
> Found this potential bug in a HDDS-7593 derived branch (but looks more 
> general) on [~weichiu]'s cluster.
> {code}
> 24/02/26 20:39:13 INFO retry.RetryInvocationHandler: 
> com.google.protobuf.ServiceException: 
> org.apache.hadoop.ipc.RemoteException(java.lang.IndexOut0fBoundsException): 
> Index: 2, Size: 3
>         at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>         at java.util.ArrayList.get(ArrayList. java:433)
>         at 
> org.apache.hadoop.hdds.scm.pipeline.Pipeline.getProtobufMessage(Pipeline.java:414)
> ...
> {code}
> thrown from {{nodes.get(i)}} in snippet:
> https://github.com/apache/ozone/blob/e1d123149ecd7d6817f09a43707570513a71fbda/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/pipeline/Pipeline.java#L414
> {code:java|title=Pipeline#getProtobufMessage}
>     // To save the message size on wire, only transfer the node order based on
>     // network topology
>     List<DatanodeDetails> nodes = nodesInOrder;
>     if (!nodes.isEmpty()) {
>       for (int i = 0; i < nodes.size(); i++) {
>         Iterator<DatanodeDetails> it = nodeStatus.keySet().iterator();
>         for (int j = 0; j < nodeStatus.keySet().size(); j++) {
>           if (it.next().equals(nodes.get(i))) {
>             builder.addMemberOrders(j);
>             break;
>           }
>         }
>       }
> {code}
> *Suspected cause:*
> {{nodes}} is changed during {{nodes.get(i)}} due to lack of synchronization. 
> This is most likely caused by (at least) one other thread modifying {{nodes}} 
> array at the same time using {{setNodesInOrder()}}: cc [~sumitagrawl]
> {code}
>   public void setNodesInOrder(List<DatanodeDetails> nodes) {
>     nodesInOrder.clear();
>     if (null == nodes) {
>       return;
>     }
>     nodesInOrder.addAll(nodes);
>   }
> {code}
> {{setNodesInOrder()}} is at least called from 
> [{{sortDatanodes()}}|https://github.com/apache/ozone/blob/c1d7b433d80baf21f32f48474db49406814dd6d2/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L1867]
>  or 
> [{{allocateBlock()}}|https://github.com/apache/ozone/blob/838cc2691b0652ade8de13c0c6156d4ca1b64751/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMBlockProtocolServer.java#L213].
>  And the former is further widely called from {{lookupKey()}}, 
> {{getOzoneFileStatus()}}, {{listStatus()}}, {{getKeyInfo()}}, etc.
> [~weichiu] pls help add more relevant log details here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to