adoroszlai opened a new pull request, #4334:
URL: https://github.com/apache/ozone/pull/4334

   ## What changes were proposed in this pull request?
   
   `KeyValueHandler` in general does not handle unchecked exceptions.  The 
exception reaches gRPC, which closes the connection abruptly.  As a result, 
client only gets some generic exception (`StatusRuntimeException: UNKNOWN`).
   
   Steps to reproduce:
   
   1. Change `BlockManagerImpl.persistPutBlock` to throw some NPE
   2. Start cluster with 5 nodes
   3. Create EC key: `ozone freon ockg -n1 -t1 --replication-type EC 
--replication rs-3-2-1024k -s100`
   
   Result in datanode log:
   
   ```
   2023-03-01 17:54:22,957 [ChunkReader-1] ERROR server.GrpcXceiverService: Got 
exception when processing ContainerCommandRequestProto cmdType: PutBlock
   traceID: ""
   containerID: 1
   datanodeUuid: "1c284e9e-d128-4c95-bded-7a1d75eecd30"
   putBlock {
     blockData {
       blockID {
         containerID: 1
         localID: 111677748019200001
         blockCommitSequenceId: 0
         replicaIndex: 2
       }
       metadata {
         key: "TYPE"
         value: "KEY"
       }
       metadata {
         key: "blockGroupLen"
         value: "100"
       }
     }
     eof: true
   }
   
   java.lang.NullPointerException: testing
        at 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.persistPutBlock(BlockManagerImpl.java:153)
        at 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.putBlock(BlockManagerImpl.java:103)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handlePutBlock(KeyValueHandler.java:539)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:243)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:219)
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:320)
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:171)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:170)
        at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:57)
        at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:50)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
        at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
        at 
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:332)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:315)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
   ```
   
   (Also `IllegalStateException: call already closed` before HDDS-8019.)
   
   Result in client:
   
   ```
   2023-03-01 17:54:22,966 [pool-4-thread-1] WARN io.KeyOutputStream: Put block 
failed: F F F F F
   2023-03-01 17:54:22,966 [pool-4-thread-1] WARN io.KeyOutputStream: Failure 
for replica index: 1, DatanodeDetails: 
ef68201e-a372-4f68-8fde-10a7fbe96505(ozone_datanode_4.ozone_default/172.20.0.10)
   java.io.IOException: Unexpected Storage Container Exception: 
java.util.concurrent.CompletionException: 
java.util.concurrent.CompletionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:632)
        at 
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.lambda$executePutBlock$2(ECBlockOutputStream.java:252)
        at 
java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
        at 
java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
        at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
        at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
        at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc$1.onError(XceiverClientGrpc.java:538)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
   Caused by: java.util.concurrent.CompletionException: 
java.util.concurrent.CompletionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
        at 
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.lambda$executePutBlock$2(ECBlockOutputStream.java:251)
        ... 15 more
   Caused by: java.util.concurrent.CompletionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
        at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
        at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
        at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
        ... 13 more
   Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNKNOWN
        at 
org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:535)
        ... 10 more
   ...
   ```
   
   This PR proposes to change datanode to catch unchecked exceptions and return 
an error response, just like it does for checked exceptions 
(`StorageContainerException`, `IOException`).
   
   This allows the client to close the connection normally, getting rid of the 
`call already completed` error.
   
   It also provides more info at the client about the cause of the error, so we 
know what to look for in the server logs.
   
   At the server-side, `HddsDispatcher.dispatchRequest` method can continue to 
execute its post-process steps (mark the container unhealthy in this case):
   
   
https://github.com/apache/ozone/blob/3d648466ed53e784b0fd34399cadaaf53894f10b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java#L320-L398
   
   https://issues.apache.org/jira/browse/HDDS-8057
   
   ## How was this patch tested?
   
   Same repro steps.
   
   Result in datanode:
   
   ```
   2023-03-01 20:14:20,539 [ChunkReader-4] INFO keyvalue.KeyValueHandler: 
Operation: PutBlock , Trace ID:  , Message: java.lang.NullPointerException: 
testing , Result: CONTAINER_INTERNAL_ERROR , StorageContainerException Occurred.
   
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.lang.NullPointerException: testing
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:223)
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:320)
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:171)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:170)
        at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:57)
        at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:50)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
        at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
        at 
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:332)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:315)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
   Caused by: java.lang.NullPointerException: testing
        at 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.persistPutBlock(BlockManagerImpl.java:153)
        at 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.putBlock(BlockManagerImpl.java:103)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handlePutBlock(KeyValueHandler.java:545)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:249)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:220)
        ... 17 more
   2023-03-01 20:14:20,544 [ChunkReader-4] WARN keyvalue.KeyValueContainer: 
Moving container 
/data/hdds/hdds/CID-ba6d0413-04c1-4b2c-bdf5-b31ac39204fb/current/containerDir0/1
 to state UNHEALTHY from state:OPEN
   2023-03-01 20:14:20,544 [ChunkReader-4] INFO impl.HddsDispatcher: Marked 
Container UNHEALTHY, ContainerID: 1
   ```
   
   Result in client:
   
   ```
   2023-03-01 20:14:20,712 [pool-4-thread-1] WARN io.KeyOutputStream: Put block 
failed: F F F F F
   2023-03-01 20:14:20,713 [pool-4-thread-1] WARN io.KeyOutputStream: Failure 
for replica index: 1, DatanodeDetails: 
a4e41040-b1d2-4c09-b9d8-636d518bc9b7(ozone_datanode_5.ozone_default/172.28.0.10)
   java.io.IOException: Unexpected Storage Container Exception: 
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.lang.NullPointerException: testing
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:632)
        at 
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:303)
        at 
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.lambda$executePutBlock$1(ECBlockOutputStream.java:234)
        at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)
        at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
   Caused by: 
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.lang.NullPointerException: testing
        at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:631)
        at 
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:301)
        ... 6 more
   ...
   ```
   
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/4307627728


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to