[ 
https://issues.apache.org/jira/browse/HDDS-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell updated HDDS-6093:
------------------------------------
    Description: 
When a datanode receives a request to download / copy a container, if the 
container does not exist in the ContainerMap on the datanode the caller does 
not get a useful error message. For example, the caller gets a stack trace like:

{code}
2021-12-08 12:46:50,537 ERROR 
org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download 
of container 10009 was unsuccessful
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
        at 
org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

To make things worse, on the source datanode, the role log does not get 
anything, and instead we get this stack trace in the stderr output:

{code}
Dec 08, 2021 12:46:50 PM 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
java.lang.NullPointerException: Container is not found 10009
        at 
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
        at 
org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
        at 
org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
        at 
org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
        at 
org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

The reason, is that a NullPointerException is thrown in 
OnDemandContainerReplicationSource and this is not caught by the caller, 
causing the exception to bubble up to Thread.run(), where it lands in stderr.

The solution is to explicity handle the null container and throw an IOException 
which will be handed and set the response status correctly:

{code}
  public void download(CopyContainerRequestProto request,
      StreamObserver<CopyContainerResponseProto> responseObserver) {
    long containerID = request.getContainerID();
    LOG.info("Streaming container data ({}) to other datanode", containerID);
    try {
      GrpcOutputStream outputStream =
          new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
      source.copyData(containerID, outputStream);
    } catch (IOException e) {
      LOG.error("Error streaming container {}", containerID, e);
      responseObserver.onError(e);
    }
  }
{code}


  was:
When a data receives a request to download / copy a container, if the container 
does not exist in the ContainerMap on the datanode the caller does not get a 
useful error message. For example, the caller gets a stack trace like:

{code}
2021-12-08 12:46:50,537 ERROR 
org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download 
of container 10009 was unsuccessful
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
        at 
org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

To make things worse, on the source datanode, the role log does not get 
anything, and instead we get this stack trace in the stderr output:

{code}
Dec 08, 2021 12:46:50 PM 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
java.lang.NullPointerException: Container is not found 10009
        at 
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
        at 
org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
        at 
org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
        at 
org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
        at 
org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

The reason, is that a NullPointerException is thrown in 
OnDemandContainerReplicationSource and this is not caught by the caller, 
causing the exception to bubble up to Thread.run(), where it lands in stderr.

The solution is to explicity handle the null container and throw an IOException 
which will be handed and set the response status correctly:

{code}
  public void download(CopyContainerRequestProto request,
      StreamObserver<CopyContainerResponseProto> responseObserver) {
    long containerID = request.getContainerID();
    LOG.info("Streaming container data ({}) to other datanode", containerID);
    try {
      GrpcOutputStream outputStream =
          new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
      source.copyData(containerID, outputStream);
    } catch (IOException e) {
      LOG.error("Error streaming container {}", containerID, e);
      responseObserver.onError(e);
    }
  }
{code}



> Improve error handling if a container not found during replication
> ------------------------------------------------------------------
>
>                 Key: HDDS-6093
>                 URL: https://issues.apache.org/jira/browse/HDDS-6093
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Datanode
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> When a datanode receives a request to download / copy a container, if the 
> container does not exist in the ContainerMap on the datanode the caller does 
> not get a useful error message. For example, the caller gets a stack trace 
> like:
> {code}
> 2021-12-08 12:46:50,537 ERROR 
> org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download 
> of container 10009 was unsuccessful
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
>       at 
> org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
>       at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> To make things worse, on the source datanode, the role log does not get 
> anything, and instead we get this stack trace in the stderr output:
> {code}
> Dec 08, 2021 12:46:50 PM 
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
> SEVERE: Exception while executing runnable 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
> java.lang.NullPointerException: Container is not found 10009
>       at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
>       at 
> org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
>       at 
> org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
>       at 
> org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
>       at 
> org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
>       at 
> org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
>       at 
> org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
>       at 
> org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason, is that a NullPointerException is thrown in 
> OnDemandContainerReplicationSource and this is not caught by the caller, 
> causing the exception to bubble up to Thread.run(), where it lands in stderr.
> The solution is to explicity handle the null container and throw an 
> IOException which will be handed and set the response status correctly:
> {code}
>   public void download(CopyContainerRequestProto request,
>       StreamObserver<CopyContainerResponseProto> responseObserver) {
>     long containerID = request.getContainerID();
>     LOG.info("Streaming container data ({}) to other datanode", containerID);
>     try {
>       GrpcOutputStream outputStream =
>           new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
>       source.copyData(containerID, outputStream);
>     } catch (IOException e) {
>       LOG.error("Error streaming container {}", containerID, e);
>       responseObserver.onError(e);
>     }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to