I was doing some testing around Network Aware Reads, where the client should try to read from a datanode closest to it. Adding a few debug log messages, and the "ozone-topology" docker-compose environment, I suspect this feature does not work correctly. Please see https://issues.apache.org/jira/browse/HDDS-3794 and tell me if I am wrong / missing something here.
Aside from the above issue, in order to use this feature, you must have "ozone.network.topology.aware.read=true". I have a few observations / ideas about this area I would like to run past everyone. 1. My understanding is that to read a key, we call KeyManagerImpl.lookupKey(). If you want the datanodes sorted, then we must make an additional RPC call to SCM for each block in the key. https://github.com/apache/hadoop-ozone/blob/6cbe8d080a74b18a54fb5e5d2993ef83abca6ead/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L667 https://github.com/apache/hadoop-ozone/blob/6cbe8d080a74b18a54fb5e5d2993ef83abca6ead/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L2178 To get the pipeline locations, we have a batch call: https://github.com/apache/hadoop-ozone/blob/6cbe8d080a74b18a54fb5e5d2993ef83abca6ead/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L711 I was wondering - should we also have a batch call to sort all the pipelines in a similar way? Or, possibly better, could we pass a flag to the getContainerWithPipelineBatch() call, to have it sort the datanodes in the pipeline rather than having to go back to SCM again to sort them? 2. If we don't have topology.aware.read enabled, then the datanode list is shuffled, so it always reads a random one. a) Would it make sense to have the client check if its IP / Host == one of the datanodes and if so, prefer to read from local? We don't need the full topology feature for that to work. I also have a small doubt about this code in ContainerProtocolCalls.java: {code} public static ContainerProtos.ReadChunkResponseProto readChunk( XceiverClientSpi xceiverClient, ChunkInfo chunk, BlockID blockID, List<CheckedBiFunction> validators) throws IOException { ReadChunkRequestProto.Builder readChunkRequest = ReadChunkRequestProto.newBuilder() .setBlockID(blockID.getDatanodeBlockIDProtobuf()) .setChunkData(chunk); String id = xceiverClient.getPipeline().getClosestNode().getUuidString(); ContainerCommandRequestProto.Builder builder = ContainerCommandRequestProto.newBuilder().setCmdType(Type.ReadChunk) .setContainerID(blockID.getContainerID()) .setDatanodeUuid(id).setReadChunk(readChunkRequest); String encodedToken = getEncodedBlockToken(new Text(blockID. getContainerBlockID().toString())); if (encodedToken != null) { builder.setEncodedToken(encodedToken); } ContainerCommandRequestProto request = builder.build(); ContainerCommandResponseProto reply = xceiverClient.sendCommand(request, validators); return reply.getReadChunk(); } {code} Why do we need to call `.setDatanodeUuid(id)` with the result of `xceiverClient.getPipeline().getClosestNode().getUuidString()`? This is possible not the node the call gets sent to, as the DN is picked in XceiverClientGrpc after shuffling the list. I tried removing the field, but it complains about a missing mandatory field. HDDS-1749 introduced the shuffle logic in XceiverClientGrpc. My suspicion is that the DN set here is not important, as it must not be used in the readChunk flow and it is possible it was missed when HDDS-1749 was added. I don't think this code does any harm but it may be misleading in debugging problems at some point. Maybe it is not needed in the call at all and we could remove it? Thanks, Stephen.