[
https://issues.apache.org/jira/browse/HDDS-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephen O'Donnell resolved HDDS-6093.
-------------------------------------
Resolution: Duplicate
> Improve error handling if a container not found during replication
> ------------------------------------------------------------------
>
> Key: HDDS-6093
> URL: https://issues.apache.org/jira/browse/HDDS-6093
> Project: Apache Ozone
> Issue Type: Improvement
> Components: Ozone Datanode
> Reporter: Stephen O'Donnell
> Assignee: Stephen O'Donnell
> Priority: Major
>
> When a datanode receives a request to download / copy a container, if the
> container does not exist in the ContainerMap on the datanode the caller does
> not get a useful error message. For example, the caller gets a stack trace
> like:
> {code}
> 2021-12-08 12:46:50,537 ERROR
> org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download
> of container 10009 was unsuccessful
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
> at
> org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> To make things worse, on the source datanode, the role log does not get
> anything, and instead we get this stack trace in the stderr output:
> {code}
> Dec 08, 2021 12:46:50 PM
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
> SEVERE: Exception while executing runnable
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
> java.lang.NullPointerException: Container is not found 10009
> at
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
> at
> org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
> at
> org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
> at
> org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
> at
> org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason, is that a NullPointerException is thrown in
> OnDemandContainerReplicationSource and this is not caught by the caller,
> causing the exception to bubble up to Thread.run(), where it lands in stderr.
> The solution is to explicity handle the null container and throw an
> IOException which will be handed and set the response status correctly:
> {code}
> public void download(CopyContainerRequestProto request,
> StreamObserver<CopyContainerResponseProto> responseObserver) {
> long containerID = request.getContainerID();
> LOG.info("Streaming container data ({}) to other datanode", containerID);
> try {
> GrpcOutputStream outputStream =
> new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
> source.copyData(containerID, outputStream);
> } catch (IOException e) {
> LOG.error("Error streaming container {}", containerID, e);
> responseObserver.onError(e);
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]