[
https://issues.apache.org/jira/browse/HDDS-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850133#comment-17850133
]
Wei-Chiu Chuang commented on HDDS-10740:
----------------------------------------
According to [~ashishkr] the problem arises because HBase aborted before Ozone
timeout and triggered failure recovery at Ozone level.
hdds.ratis.raft.client.rpc.watch.request.timeout default value is 180 seconds.
Updating it to a smaller number like 10 seconds helped overcome the issue. IMO
10 seconds is too low to tolerate temporary unavailability. The HDFS client
write acknowledgement timeout is 60 seconds by default so we should be good as
long as we are consistent with HDFS.
> [Hbase-Ozone] HMaster down due to "ContainerNotOpenException: Container in
> CLOSED state"
> ----------------------------------------------------------------------------------------
>
> Key: HDDS-10740
> URL: https://issues.apache.org/jira/browse/HDDS-10740
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode, SCM
> Reporter: Pratyush Bhatt
> Priority: Major
>
> HMaster abruptly crashes down, checked the logs, just before the crash logs
> like this are there:
> {code:java}
> java.util.concurrent.CompletionException:
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException
> from Server 50954181-a303-4e2f-aca5-c70f235191f1@group-9CCD951DCB08:
> Container 2061 in CLOSED state {code}
> Full related log:
> {code:java}
> 2024-04-22 08:05:26,709 ERROR org.apache.ratis.client.impl.OrderedAsync:
> Failed to send request, message=cmdType: WriteChunk
> traceID: ""
> containerID: 2061
> datanodeUuid: "a3535b74-fc72-443e-b66d-cb0da825c469"
> writeChunk {
> blockID {
> containerID: 2061
> localID: 113750153625619822
> blockCommitSequenceId: 18190201
> replicaIndex: 0
> }
> chunkData {
> chunkName: "113750153625619822_chunk_2699"
> offset: 1497659
> len: 98
> checksumData {
> type: CRC32
> bytesPerChecksum: 16384
> checksums: "U\321\246\212"
> }
> }
> }
> encodedToken:
> "VwoFaGJhc2USJWNvbklEOiAyMDYxIGxvY0lEOiAxMTM3NTAxNTM2MjU2MTk4MjIY5ILNz_AxKAEoAjCAgICAAToWCLud9efY8qvz5QEQlLemrcCwiumPASCvEZ-NqRpFuh6-H1ottQt1_14NiKrfTck8ZuC5FzTX6xBIRERTX0JMT0NLX1RPS0VOLGNvbklEOiAyMDYxIGxvY0lEOiAxMTM3NTAxNTM2MjU2MTk4MjIAAAAAAAAA"
> , data.size=98
> java.util.concurrent.CompletionException:
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException
> from Server 50954181-a303-4e2f-aca5-c70f235191f1@group-9CCD951DCB08:
> Container 2061 in CLOSED state
> at
> org.apache.ratis.client.impl.RaftClientImpl.handleRaftException(RaftClientImpl.java:374)
> at
> org.apache.ratis.client.impl.OrderedAsync.lambda$send$3(OrderedAsync.java:173)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> at
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:99)
> at
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:60)
> at
> org.apache.ratis.util.SlidingWindow$RequestMap.setReply(SlidingWindow.java:144)
> at
> org.apache.ratis.util.SlidingWindow$Client.receiveReply(SlidingWindow.java:348)
> at
> org.apache.ratis.client.impl.OrderedAsync.lambda$sendRequest$8(OrderedAsync.java:243)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.lambda$onNext$0(GrpcClientProtocolClient.java:322)
> at java.util.Optional.ifPresent(Optional.java:159)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:378)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.access$100(GrpcClientProtocolClient.java:300)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:322)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:305)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:468)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:473)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:660)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:647)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException
> from Server 50954181-a303-4e2f-aca5-c70f235191f1@group-9CCD951DCB08:
> Container 2061 in CLOSED state
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.validateContainerCommand(HddsDispatcher.java:560)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.startTransaction(ContainerStateMachine.java:415)
> at
> org.apache.ratis.server.impl.RaftServerImpl.writeAsync(RaftServerImpl.java:941)
> at
> org.apache.ratis.server.impl.RaftServerImpl.replyFuture(RaftServerImpl.java:919)
> at
> org.apache.ratis.server.impl.RaftServerImpl.submitClientRequestAsync(RaftServerImpl.java:896)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$null$11(RaftServerImpl.java:885)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:117)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitClientRequestAsync$12(RaftServerImpl.java:885)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> ... 3 more
> Caused by:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException:
> Container 2061 in CLOSED state
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.ratis.util.ReflectionUtils.instantiateException(ReflectionUtils.java:259)
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toStateMachineException(ClientProtoUtils.java:449)
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toStateMachineException(ClientProtoUtils.java:435)
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toRaftClientReply(ClientProtoUtils.java:402)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:310)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:305)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:468)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:473)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:660)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:647)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> ... 3 more
> 2024-04-22 08:05:26,832 ERROR org.apache.ratis.client.impl.OrderedAsync:
> Failed to send request, message=cmdType: PutBlock
> traceID: ""
> containerID: 2061
> datanodeUuid: "a3535b74-fc72-443e-b66d-cb0da825c469"
> putBlock {
> blockData {
> blockID {
> containerID: 2061
> localID: 113750153625619822
> blockCommitSequenceId: 0
> }
> metadata {
> key: "TYPE"
> value: "KEY"
> }
> metadata {
> key: "incremental"
> }
> chunks {
> chunkName: "113750153625619822_chunk_0"
> offset: 0
> len: 1497757
> checksumData {
> type: CRC32
> bytesPerChecksum: 16384
> checksums: ".M]\274"
> checksums: "\341f@\350"
> checksums: "3\215@\243"
> checksums: "\027\220|\226"
> checksums: "xE8B"
> checksums: ",\300\263\233"
> checksums: "#\314\246x"
> checksums: "\313\220\211\362"
> checksums: "\337P6\004"
> checksums: "\351\334(\032"
> checksums: "l\315\005["
> checksums: "P\311\212\245"
> checksums: "\355R\361\235"
> checksums: "\256\341\206\304"
> checksums: "x\304\353\322"
> checksums: "q\257\337\027"
> checksums: "e\253\304\241"
> checksums: "Fy`6"
> checksums: "\351A\221\351"
> checksums: "\270\243\366T"
> checksums: "\246\264aN"
> checksums: "V`\033\003"
> checksums: " $F\214"
> .
> .
> .
> checksums: "D5\360\350"
> checksums: "\360w\314X"
> checksums: "\350\025\003\263"
> checksums: "\347\310\334\215"
> }
> }
> }
> eof: false
> }
> encodedToken:
> "VwoFaGJhc2USJWNvbklEOiAyMDYxIGxvY0lEOiAxMTM3NTAxNTM2MjU2MTk4MjIY5ILNz_AxKAEoAjCAgICAAToWCLud9efY8qvz5QEQlLemrcCwiumPASCvEZ-NqRpFuh6-H1ottQt1_14NiKrfTck8ZuC5FzTX6xBIRERTX0JMT0NLX1RPS0VOLGNvbklEOiAyMDYxIGxvY0lEOiAxMTM3NTAxNTM2MjU2MTk4MjIAAAAAAAAA"
> , data.size=0
> java.util.concurrent.CompletionException:
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException
> from Server 50954181-a303-4e2f-aca5-c70f235191f1@group-9CCD951DCB08:
> Container 2061 in CLOSED state
> at
> org.apache.ratis.client.impl.RaftClientImpl.handleRaftException(RaftClientImpl.java:374)
> at
> org.apache.ratis.client.impl.OrderedAsync.lambda$send$3(OrderedAsync.java:173)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> at
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:99)
> at
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:60)
> at
> org.apache.ratis.util.SlidingWindow$RequestMap.setReply(SlidingWindow.java:144)
> at
> org.apache.ratis.util.SlidingWindow$Client.receiveReply(SlidingWindow.java:348)
> at
> org.apache.ratis.client.impl.OrderedAsync.lambda$sendRequest$8(OrderedAsync.java:243)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.lambda$onNext$0(GrpcClientProtocolClient.java:322)
> at java.util.Optional.ifPresent(Optional.java:159)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:378)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.access$100(GrpcClientProtocolClient.java:300)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:322)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:305)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:468)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:473)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:660)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:647)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException
> from Server 50954181-a303-4e2f-aca5-c70f235191f1@group-9CCD951DCB08:
> Container 2061 in CLOSED state
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.validateContainerCommand(HddsDispatcher.java:560)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.startTransaction(ContainerStateMachine.java:415)
> at
> org.apache.ratis.server.impl.RaftServerImpl.writeAsync(RaftServerImpl.java:941)
> at
> org.apache.ratis.server.impl.RaftServerImpl.replyFuture(RaftServerImpl.java:919)
> at
> org.apache.ratis.server.impl.RaftServerImpl.submitClientRequestAsync(RaftServerImpl.java:896)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$null$11(RaftServerImpl.java:885)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:117)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitClientRequestAsync$12(RaftServerImpl.java:885)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> ... 3 more
> Caused by:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException:
> Container 2061 in CLOSED state
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.ratis.util.ReflectionUtils.instantiateException(ReflectionUtils.java:259)
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toStateMachineException(ClientProtoUtils.java:449)
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toStateMachineException(ClientProtoUtils.java:435)
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toRaftClientReply(ClientProtoUtils.java:402)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:310)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:305)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:468)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:473)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:660)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:647)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> ... 3 more
> 2024-04-22 08:05:30,039 WARN org.apache.ratis.grpc.GrpcUtil: Timed out
> gracefully shutting down connection:
> ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=191,
> target=10.140.52.141:9858}}. {code}
> And then the Master goes down:
> {code:java}
> 2024-04-22 08:08:02,026 ERROR org.apache.hadoop.hbase.master.HMaster: *****
> ABORTING master ccycloud-7.ozn-hb973chf3oz.xyz,22001,1713770648404: Log
> rolling failed *****
> java.lang.RuntimeException
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALMetadata(AsyncProtobufLogWriter.java:217)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:223)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:164)
> at
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:116)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:726)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:129)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:886)
> at
> org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:304)
> at
> org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:211)
> 2024-04-22 08:08:02,034 INFO org.apache.ranger.plugin.util.PolicyRefresher:
> PolicyRefresher(serviceName=cm_hbase).run(): interrupted! Exiting thread
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
> org.apache.ranger.plugin.util.PolicyRefresher.run(PolicyRefresher.java:208)
> 2024-04-22 08:08:02,037 INFO
> org.apache.ranger.audit.provider.AuditProviderFactory: ==>
> JVMShutdownHook.run() {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]