adoroszlai opened a new pull request, #4334:
URL: https://github.com/apache/ozone/pull/4334
## What changes were proposed in this pull request?
`KeyValueHandler` in general does not handle unchecked exceptions. The
exception reaches gRPC, which closes the connection abruptly. As a result,
client only gets some generic exception (`StatusRuntimeException: UNKNOWN`).
Steps to reproduce:
1. Change `BlockManagerImpl.persistPutBlock` to throw some NPE
2. Start cluster with 5 nodes
3. Create EC key: `ozone freon ockg -n1 -t1 --replication-type EC
--replication rs-3-2-1024k -s100`
Result in datanode log:
```
2023-03-01 17:54:22,957 [ChunkReader-1] ERROR server.GrpcXceiverService: Got
exception when processing ContainerCommandRequestProto cmdType: PutBlock
traceID: ""
containerID: 1
datanodeUuid: "1c284e9e-d128-4c95-bded-7a1d75eecd30"
putBlock {
blockData {
blockID {
containerID: 1
localID: 111677748019200001
blockCommitSequenceId: 0
replicaIndex: 2
}
metadata {
key: "TYPE"
value: "KEY"
}
metadata {
key: "blockGroupLen"
value: "100"
}
}
eof: true
}
java.lang.NullPointerException: testing
at
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.persistPutBlock(BlockManagerImpl.java:153)
at
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.putBlock(BlockManagerImpl.java:103)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handlePutBlock(KeyValueHandler.java:539)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:243)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:219)
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:320)
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:171)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:170)
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:57)
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:50)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
at
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:332)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:315)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
```
(Also `IllegalStateException: call already closed` before HDDS-8019.)
Result in client:
```
2023-03-01 17:54:22,966 [pool-4-thread-1] WARN io.KeyOutputStream: Put block
failed: F F F F F
2023-03-01 17:54:22,966 [pool-4-thread-1] WARN io.KeyOutputStream: Failure
for replica index: 1, DatanodeDetails:
ef68201e-a372-4f68-8fde-10a7fbe96505(ozone_datanode_4.ozone_default/172.20.0.10)
java.io.IOException: Unexpected Storage Container Exception:
java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:632)
at
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.lambda$executePutBlock$2(ECBlockOutputStream.java:252)
at
java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
at
java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc$1.onError(XceiverClientGrpc.java:538)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
at
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.lambda$executePutBlock$2(ECBlockOutputStream.java:251)
... 15 more
Caused by: java.util.concurrent.CompletionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
at
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
at
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
at
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
... 13 more
Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNKNOWN
at
org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:535)
... 10 more
...
```
This PR proposes to change datanode to catch unchecked exceptions and return
an error response, just like it does for checked exceptions
(`StorageContainerException`, `IOException`).
This allows the client to close the connection normally, getting rid of the
`call already completed` error.
It also provides more info at the client about the cause of the error, so we
know what to look for in the server logs.
At the server-side, `HddsDispatcher.dispatchRequest` method can continue to
execute its post-process steps (mark the container unhealthy in this case):
https://github.com/apache/ozone/blob/3d648466ed53e784b0fd34399cadaaf53894f10b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java#L320-L398
https://issues.apache.org/jira/browse/HDDS-8057
## How was this patch tested?
Same repro steps.
Result in datanode:
```
2023-03-01 20:14:20,539 [ChunkReader-4] INFO keyvalue.KeyValueHandler:
Operation: PutBlock , Trace ID: , Message: java.lang.NullPointerException:
testing , Result: CONTAINER_INTERNAL_ERROR , StorageContainerException Occurred.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
java.lang.NullPointerException: testing
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:223)
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:320)
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:171)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:170)
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:57)
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:50)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
at
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:332)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:315)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException: testing
at
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.persistPutBlock(BlockManagerImpl.java:153)
at
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.putBlock(BlockManagerImpl.java:103)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handlePutBlock(KeyValueHandler.java:545)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:249)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:220)
... 17 more
2023-03-01 20:14:20,544 [ChunkReader-4] WARN keyvalue.KeyValueContainer:
Moving container
/data/hdds/hdds/CID-ba6d0413-04c1-4b2c-bdf5-b31ac39204fb/current/containerDir0/1
to state UNHEALTHY from state:OPEN
2023-03-01 20:14:20,544 [ChunkReader-4] INFO impl.HddsDispatcher: Marked
Container UNHEALTHY, ContainerID: 1
```
Result in client:
```
2023-03-01 20:14:20,712 [pool-4-thread-1] WARN io.KeyOutputStream: Put block
failed: F F F F F
2023-03-01 20:14:20,713 [pool-4-thread-1] WARN io.KeyOutputStream: Failure
for replica index: 1, DatanodeDetails:
a4e41040-b1d2-4c09-b9d8-636d518bc9b7(ozone_datanode_5.ozone_default/172.28.0.10)
java.io.IOException: Unexpected Storage Container Exception:
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
java.lang.NullPointerException: testing
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:632)
at
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:303)
at
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.lambda$executePutBlock$1(ECBlockOutputStream.java:234)
at
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)
at
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by:
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
java.lang.NullPointerException: testing
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:631)
at
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:301)
... 6 more
...
```
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4307627728
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]