[
https://issues.apache.org/jira/browse/HDDS-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephen O'Donnell updated HDDS-6093:
------------------------------------
Description:
When a datanode receives a request to download / copy a container, if the
container does not exist in the ContainerMap on the datanode the caller does
not get a useful error message. For example, the caller gets a stack trace like:
{code}
2021-12-08 12:46:50,537 ERROR
org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download
of container 10009 was unsuccessful
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
at
org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
To make things worse, on the source datanode, the role log does not get
anything, and instead we get this stack trace in the stderr output:
{code}
Dec 08, 2021 12:46:50 PM
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
java.lang.NullPointerException: Container is not found 10009
at
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
at
org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
at
org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
at
org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
at
org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
The reason, is that a NullPointerException is thrown in
OnDemandContainerReplicationSource and this is not caught by the caller,
causing the exception to bubble up to Thread.run(), where it lands in stderr.
The solution is to explicity handle the null container and throw an IOException
which will be handed and set the response status correctly:
{code}
public void download(CopyContainerRequestProto request,
StreamObserver<CopyContainerResponseProto> responseObserver) {
long containerID = request.getContainerID();
LOG.info("Streaming container data ({}) to other datanode", containerID);
try {
GrpcOutputStream outputStream =
new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
source.copyData(containerID, outputStream);
} catch (IOException e) {
LOG.error("Error streaming container {}", containerID, e);
responseObserver.onError(e);
}
}
{code}
was:
When a data receives a request to download / copy a container, if the container
does not exist in the ContainerMap on the datanode the caller does not get a
useful error message. For example, the caller gets a stack trace like:
{code}
2021-12-08 12:46:50,537 ERROR
org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download
of container 10009 was unsuccessful
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
at
org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
To make things worse, on the source datanode, the role log does not get
anything, and instead we get this stack trace in the stderr output:
{code}
Dec 08, 2021 12:46:50 PM
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
java.lang.NullPointerException: Container is not found 10009
at
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
at
org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
at
org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
at
org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
at
org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
The reason, is that a NullPointerException is thrown in
OnDemandContainerReplicationSource and this is not caught by the caller,
causing the exception to bubble up to Thread.run(), where it lands in stderr.
The solution is to explicity handle the null container and throw an IOException
which will be handed and set the response status correctly:
{code}
public void download(CopyContainerRequestProto request,
StreamObserver<CopyContainerResponseProto> responseObserver) {
long containerID = request.getContainerID();
LOG.info("Streaming container data ({}) to other datanode", containerID);
try {
GrpcOutputStream outputStream =
new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
source.copyData(containerID, outputStream);
} catch (IOException e) {
LOG.error("Error streaming container {}", containerID, e);
responseObserver.onError(e);
}
}
{code}
> Improve error handling if a container not found during replication
> ------------------------------------------------------------------
>
> Key: HDDS-6093
> URL: https://issues.apache.org/jira/browse/HDDS-6093
> Project: Apache Ozone
> Issue Type: Improvement
> Components: Ozone Datanode
> Reporter: Stephen O'Donnell
> Assignee: Stephen O'Donnell
> Priority: Major
>
> When a datanode receives a request to download / copy a container, if the
> container does not exist in the ContainerMap on the datanode the caller does
> not get a useful error message. For example, the caller gets a stack trace
> like:
> {code}
> 2021-12-08 12:46:50,537 ERROR
> org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download
> of container 10009 was unsuccessful
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
> at
> org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> To make things worse, on the source datanode, the role log does not get
> anything, and instead we get this stack trace in the stderr output:
> {code}
> Dec 08, 2021 12:46:50 PM
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
> SEVERE: Exception while executing runnable
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
> java.lang.NullPointerException: Container is not found 10009
> at
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
> at
> org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
> at
> org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
> at
> org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
> at
> org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason, is that a NullPointerException is thrown in
> OnDemandContainerReplicationSource and this is not caught by the caller,
> causing the exception to bubble up to Thread.run(), where it lands in stderr.
> The solution is to explicity handle the null container and throw an
> IOException which will be handed and set the response status correctly:
> {code}
> public void download(CopyContainerRequestProto request,
> StreamObserver<CopyContainerResponseProto> responseObserver) {
> long containerID = request.getContainerID();
> LOG.info("Streaming container data ({}) to other datanode", containerID);
> try {
> GrpcOutputStream outputStream =
> new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
> source.copyData(containerID, outputStream);
> } catch (IOException e) {
> LOG.error("Error streaming container {}", containerID, e);
> responseObserver.onError(e);
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]