Stephen O'Donnell created HDDS-6093:
---------------------------------------
Summary: Improve error handling if a container not found during
replication
Key: HDDS-6093
URL: https://issues.apache.org/jira/browse/HDDS-6093
Project: Apache Ozone
Issue Type: Improvement
Components: Ozone Datanode
Reporter: Stephen O'Donnell
Assignee: Stephen O'Donnell
When a data receives a request to download / copy a container, if the container
does not exist in the ContainerMap on the datanode the caller does not get a
useful error message. For example, the caller gets a stack trace like:
{code}
2021-12-08 12:46:50,537 ERROR
org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download
of container 10009 was unsuccessful
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
at
org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
To make things worse, on the source datanode, the role log does not get
anything, and instead we get this stack trace in the stderr output:
{code}
Dec 08, 2021 12:46:50 PM
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
java.lang.NullPointerException: Container is not found 10009
at
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
at
org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
at
org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
at
org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
at
org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
The reason, is that a NullPointerException is thrown in
OnDemandContainerReplicationSource and this is not caught by the caller,
causing the exception to bubble up to Thread.run(), where it lands in stderr.
The solution is to explicity handle the null container and throw an IOException
which will be handed and set the response status correctly:
{code}
public void download(CopyContainerRequestProto request,
StreamObserver<CopyContainerResponseProto> responseObserver) {
long containerID = request.getContainerID();
LOG.info("Streaming container data ({}) to other datanode", containerID);
try {
GrpcOutputStream outputStream =
new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
source.copyData(containerID, outputStream);
} catch (IOException e) {
LOG.error("Error streaming container {}", containerID, e);
responseObserver.onError(e);
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]