Topology Aware Read - issues and questions

Stephen O'Donnell Fri, 12 Jun 2020 08:40:48 -0700

I was doing some testing around Network Aware Reads, where the client
should try to read from a datanode closest to it. Adding a few debug log
messages, and the "ozone-topology" docker-compose environment, I suspect
this feature does not work correctly. Please see
https://issues.apache.org/jira/browse/HDDS-3794 and tell me if I am wrong /
missing something here.


Aside from the above issue, in order to use this feature, you must have
"ozone.network.topology.aware.read=true".

I have a few observations / ideas about this area I would like to run past
everyone.

1. My understanding is that to read a key, we call
KeyManagerImpl.lookupKey(). If you want the datanodes sorted, then we must
make an additional RPC call to SCM for each block in the key.

https://github.com/apache/hadoop-ozone/blob/6cbe8d080a74b18a54fb5e5d2993ef83abca6ead/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L667

https://github.com/apache/hadoop-ozone/blob/6cbe8d080a74b18a54fb5e5d2993ef83abca6ead/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L2178

To get the pipeline locations, we have a batch call:

https://github.com/apache/hadoop-ozone/blob/6cbe8d080a74b18a54fb5e5d2993ef83abca6ead/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L711

I was wondering - should we also have a batch call to sort all the
pipelines in a similar way?

Or, possibly better, could we pass a flag to the
getContainerWithPipelineBatch() call, to have it sort the datanodes in the
pipeline rather than having to go back to SCM again to sort them?

2. If we don't have topology.aware.read enabled, then the datanode list is
shuffled, so it always reads a random one.

a) Would it make sense to have the client check if its IP / Host == one of
the datanodes and if so, prefer to read from local? We don't need the full
topology feature for that to work.

I also have a small doubt about this code in ContainerProtocolCalls.java:

{code}
  public static ContainerProtos.ReadChunkResponseProto readChunk(
      XceiverClientSpi xceiverClient, ChunkInfo chunk, BlockID blockID,
      List<CheckedBiFunction> validators) throws IOException {
    ReadChunkRequestProto.Builder readChunkRequest =
        ReadChunkRequestProto.newBuilder()
            .setBlockID(blockID.getDatanodeBlockIDProtobuf())
            .setChunkData(chunk);
    String id =
xceiverClient.getPipeline().getClosestNode().getUuidString();
    ContainerCommandRequestProto.Builder builder =
        ContainerCommandRequestProto.newBuilder().setCmdType(Type.ReadChunk)
            .setContainerID(blockID.getContainerID())
            .setDatanodeUuid(id).setReadChunk(readChunkRequest);
    String encodedToken = getEncodedBlockToken(new Text(blockID.
        getContainerBlockID().toString()));
    if (encodedToken != null) {
      builder.setEncodedToken(encodedToken);
    }
    ContainerCommandRequestProto request = builder.build();
    ContainerCommandResponseProto reply =
        xceiverClient.sendCommand(request, validators);
    return reply.getReadChunk();
  }
{code}

Why do we need to call `.setDatanodeUuid(id)` with the result of
`xceiverClient.getPipeline().getClosestNode().getUuidString()`? This is
possible not the node the call gets sent to, as the DN is picked in
XceiverClientGrpc after shuffling the list. I tried removing the field, but
it complains about a missing mandatory field.

HDDS-1749 introduced the shuffle logic in XceiverClientGrpc. My suspicion
is that the DN set here is not important, as it must not be used in the
readChunk flow and it is possible it was missed when HDDS-1749 was added. I
don't think this code does any harm but it may be misleading in debugging
problems at some point. Maybe it is not needed in the call at all and we
could remove it?

Thanks,

Stephen.

Topology Aware Read - issues and questions

Reply via email to