Stephen O'Donnell created HDDS-6093:
---------------------------------------

             Summary: Improve error handling if a container not found during 
replication
                 Key: HDDS-6093
                 URL: https://issues.apache.org/jira/browse/HDDS-6093
             Project: Apache Ozone
          Issue Type: Improvement
          Components: Ozone Datanode
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


When a data receives a request to download / copy a container, if the container 
does not exist in the ContainerMap on the datanode the caller does not get a 
useful error message. For example, the caller gets a stack trace like:

{code}
2021-12-08 12:46:50,537 ERROR 
org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download 
of container 10009 was unsuccessful
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
        at 
org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

To make things worse, on the source datanode, the role log does not get 
anything, and instead we get this stack trace in the stderr output:

{code}
Dec 08, 2021 12:46:50 PM 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
java.lang.NullPointerException: Container is not found 10009
        at 
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
        at 
org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
        at 
org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
        at 
org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
        at 
org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

The reason, is that a NullPointerException is thrown in 
OnDemandContainerReplicationSource and this is not caught by the caller, 
causing the exception to bubble up to Thread.run(), where it lands in stderr.

The solution is to explicity handle the null container and throw an IOException 
which will be handed and set the response status correctly:

{code}
  public void download(CopyContainerRequestProto request,
      StreamObserver<CopyContainerResponseProto> responseObserver) {
    long containerID = request.getContainerID();
    LOG.info("Streaming container data ({}) to other datanode", containerID);
    try {
      GrpcOutputStream outputStream =
          new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
      source.copyData(containerID, outputStream);
    } catch (IOException e) {
      LOG.error("Error streaming container {}", containerID, e);
      responseObserver.onError(e);
    }
  }
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to