[ 
https://issues.apache.org/jira/browse/HDDS-14964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Sarin updated HDDS-14964:
------------------------------
    Priority: Critical  (was: Major)

> Write failures with ContainerNotOpenException during large file (100GB) 
> concurrent parallel writes at high cluster utilization (~90%)
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-14964
>                 URL: https://issues.apache.org/jira/browse/HDDS-14964
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Arun Sarin
>            Priority: Critical
>
> During a large-scale deletion test, the cluster was being filled with 100GB 
> files nested across various directory structures. Writes were progressing 
> successfully until ~90% disk utilization was reached on approximately 80% of 
> the DataNodes. At that point, all large file writes started failing with 
> {{{}ContainerNotOpenException{}}}.
> Key observation: Small file writes on the same cluster succeeded without 
> issue during this same time window. Only concurrent parallel writes of large 
> files (100GB) were failing. This suggests the issue is not that there are 
> zero writable containers on the cluster, but rather that the client - while 
> in the middle of writing a large file - encounters a container that has 
> transitioned to {{CLOSED}} state and is unable to recover or allocate a new 
> one successfully in that context.
> *Root cause hypothesis:* When writing a 100GB file, many chunks are sent 
> sequentially to the same container. At high cluster utilization, a container 
> can transition to {{CLOSED}} state mid-write (e.g., due to being full or the 
> SCM closing it). The client hits {{ContainerNotOpenException}} on a 
> {{WriteChunk}} call. While retry/failover logic exists, under high disk 
> pressure the client is unable to get a new OPEN container allocated in time, 
> causing the write to fail entirely.
> Error Stack :
> {code:java}
> java.util.concurrent.CompletionException: 
> org.apache.ratis.protocol.exceptions.StateMachineException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException 
> from Server 098be32c-a26e-4fe9-a491-c7d17b9bb04b@group-D9554F414C68: 
> Container 20677 in CLOSED state
>       at 
> org.apache.ratis.client.impl.RaftClientImpl.handleRaftException(RaftClientImpl.java:373)
>       at 
> org.apache.ratis.client.impl.OrderedAsync.lambda$send$3(OrderedAsync.java:175)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>       at 
> java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
>       at 
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:105)
>       at 
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:66)
>       at 
> org.apache.ratis.util.SlidingWindow$RequestMap.setReply(SlidingWindow.java:147)
>       at 
> org.apache.ratis.util.SlidingWindow$Client.receiveReply(SlidingWindow.java:351)
>       at 
> org.apache.ratis.client.impl.OrderedAsync.lambda$sendRequestWithRetry$5(OrderedAsync.java:210)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:718)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>       at 
> java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
>       at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.lambda$onNext$0(GrpcClientProtocolClient.java:325)
>       at java.base/java.util.Optional.ifPresent(Optional.java:178)
>       at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:381)
>       at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:325)
>       at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:308)
>       at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:551)
>       at 
> org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:661)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:648)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>       at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: org.apache.ratis.protocol.exceptions.StateMachineException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException 
> from Server 098be32c-a26e-4fe9-a491-c7d17b9bb04b@group-D9554F414C68: 
> Container 20677 in CLOSED state
>       at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.validateContainerCommand(HddsDispatcher.java:581)
>       at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.startTransaction(ContainerStateMachine.java:488)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.writeAsyncImpl(RaftServerImpl.java:987)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.writeAsync(RaftServerImpl.java:960)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.replyFuture(RaftServerImpl.java:953)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.submitClientRequestAsync(RaftServerImpl.java:930)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitClientRequestAsync$11(RaftServerImpl.java:919)
>       at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitClientRequestAsync$12(RaftServerImpl.java:919)
>       at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
>       ... 3 more
> Caused by: 
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException:
>  Container 20677 in CLOSED state
>       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
>       at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>       at 
> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
>       at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
>       at 
> org.apache.ratis.util.ReflectionUtils.instantiateException(ReflectionUtils.java:265)
>       at 
> org.apache.ratis.client.impl.ClientProtoUtils.toStateMachineException(ClientProtoUtils.java:455)
>       at 
> org.apache.ratis.client.impl.ClientProtoUtils.toStateMachineException(ClientProtoUtils.java:441)
>       at 
> org.apache.ratis.client.impl.ClientProtoUtils.toRaftClientReply(ClientProtoUtils.java:408)
>       at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:313)
>       at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:308)
>       at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:551)
>       at 
> org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:661)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:648)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>       at 
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
>       ... 3 more
> 26/03/02 23:09:43 ERROR impl.OrderedAsync: Failed to send request, 
> message=cmdType: WriteChunk
> traceID: ""
> containerID: 20677
> datanodeUuid: "c15cc65e-f4eb-43f0-b27c-6b6a3fbc5a86"
> writeChunk {
>   blockID {
>     containerID: 20677
>     localID: 117883640217932227
>     blockCommitSequenceId: 686720
>     replicaIndex: 0
>   }
>   chunkData {
>     chunkName: "117883640217932227_chunk_53"
>     offset: 218103808
>     len: 4194304
>     checksumData {
>       type: CRC32
>       bytesPerChecksum: 1048576
>       checksums: "\216\301\375\321"
>       checksums: "\300\2022\n"
>       checksums: "\334\002L0"
>       checksums: "\335m=\322"
>     }
>   }
> } {code}
> *Steps to Reproduce:*
>  # Set up an Apache Ozone cluster with sufficient DataNodes (e.g., 10+).
>  # Write 100GB files concurrently across various directory structures using 
> Freon (or similar tool) targeting petabyte-scale data.
>  # Continue writing until approximately 80–90% disk utilization is reached on 
> the majority of DataNodes.
>  # Observe that large file writes begin failing with 
> {{{}ContainerNotOpenException{}}}, while small file writes on the same 
> cluster continue to succeed.
> *Expected Behavior:*
> Large file writes should either:
>  * Successfully obtain a new OPEN container when the current container 
> transitions to CLOSED mid-write and continue writing, OR
>  * Fail early with a clear, user-facing error (e.g., "Insufficient cluster 
> space") before the write is initiated, rather than failing mid-stream after 
> chunks have already been sent.
> Actual Behavior:
> Large file (100GB) writes fail mid-stream with:
> java.util.concurrent.CompletionException: 
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException
> from Server 098be32c-a26e-4fe9-a491-c7d17b9bb04b@group-D9554F414C68: 
> Container 20677 in CLOSED state
> at 
> org.apache.ratis.client.impl.RaftClientImpl.handleRaftException(RaftClientImpl.java:373)
> at 
> org.apache.ratis.client.impl.OrderedAsync.lambda$send$3(OrderedAsync.java:175)
> ...
> Caused by: 
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException:
>  Container 20677 in CLOSED state
> at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.validateContainerCommand(HddsDispatcher.java:581)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.startTransaction(ContainerStateMachine.java:488)
> ...
>  
> Failed {{WriteChunk}} request details:
>  * Container ID: 20677
>  * Block local ID: 117883640217932227
>  * Chunk: {{{}117883640217932227_chunk_53{}}}, offset: 218103808, len: 4194304
> *Proposed Improvement:*
> Similar to how quota validation is performed before a key {{PUT}} (checking 
> namespace/space quota against available capacity), a pre-write cluster-level 
> space check should be introduced before initiating a large file write. 
> Specifically:
>  * Before allocating block pipelines for a write, check whether the cluster 
> has sufficient OPEN/allocable containers to accommodate the expected write 
> size.
>  * If the cluster is approaching full capacity and cannot allocate the 
> required containers, fail fast with a clear error (e.g., 
> {{ClusterStorageFullException}} or similar) rather than allowing the write to 
> proceed and fail mid-stream with a confusing 
> {{{}ContainerNotOpenException{}}}.
> This would provide better UX, avoid partial writes/orphaned blocks, and align 
> with the existing quota-check pattern already present at the OM layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to