[
https://issues.apache.org/jira/browse/HDDS-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854921#comment-17854921
]
Wei-Chiu Chuang commented on HDDS-10978:
----------------------------------------
Yes we now deploy an updated configuration
hdds.ratis.client.multilinear.random.retry.policy = 5s, 5
and that seems to avoid the problem now.
> [Hbase Ozone] Regionserver down due to DN crash in Pipeline
> -----------------------------------------------------------
>
> Key: HDDS-10978
> URL: https://issues.apache.org/jira/browse/HDDS-10978
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Pratyush Bhatt
> Priority: Major
>
> Two of the DNs(DN-1, DN-2) was abruptly shutted down in the cluster, the RS
> then goes down with "3 way commit failed on pipeline".
> This cluster has:
>
> {code:java}
> ozone.client.stream.putblock.piggybacking = true
> ozone.client.incremental.chunk.list = true
> hdds.ratis.raft.client.rpc.watch.request.timeout=10s{code}
> The RS keeps on retrying till the timeout i.e. 5 Minutes
> Related logs:
> {code:java}
> 2024-06-03 17:00:48,583 WARN org.apache.hadoop.hdds.scm.XceiverClientRatis: 3
> way commit failed on pipeline Pipeline[ Id:
> 36a0af6d-5955-41a5-ba03-57de0b16d45f, Nodes:
> 882ad4eb-04f9-418e-9ea6-0802b19beade(DN-1/10.17.207.46)88b461a6-88b9-4635-8184-ede38ec14eef(DN-2/10.17.207.42)0d46c800-57c7-4a0d-b746-7e901ab77fa6(DN-
> 3/10.17.207.49), ReplicationConfig: RATIS/THREE, State:OPEN,
> leaderId:882ad4eb-04f9-418e-9ea6-0802b19beade,
> CreationTimestamp2024-06-02T22:50:59.940-07:00[America/Los_Angeles]]
> java.util.concurrent.ExecutionException:
> org.apache.ratis.protocol.exceptions.TimeoutIOException:
> client-555D064DF2EB->882ad4eb-04f9-418e-9ea6-0802b19beade request #3498690
> timeout 10s
> at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:280)
> at
> org.apache.hadoop.hdds.scm.storage.AbstractCommitWatcher.watchForCommit(AbstractCommitWatcher.java:142)
> at
> org.apache.hadoop.hdds.scm.storage.AbstractCommitWatcher.watchOnLastIndex(AbstractCommitWatcher.java:120)
> at
> org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.sendWatchForCommit(RatisBlockOutputStream.java:107)
> at
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:457)
> at
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlushInternal(BlockOutputStream.java:637)
> at
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:590)
> at
> org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.hsync(RatisBlockOutputStream.java:139)
> at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.hsync(BlockOutputStreamEntry.java:163)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleStreamAction(KeyOutputStream.java:524)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:487)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:457)
> at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75)
> at
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669)
> Caused by: org.apache.ratis.protocol.exceptions.TimeoutIOException:
> client-555D064DF2EB->882ad4eb-04f9-418e-9ea6-0802b19beade request #3498690
> timeout 10s
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$timeoutCheck$5(GrpcClientProtocolClient.java:374)
> at java.util.Optional.ifPresent(Optional.java:159)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:378)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.timeoutCheck(GrpcClientProtocolClient.java:373)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$onNext$1(GrpcClientProtocolClient.java:362)
> at
> org.apache.ratis.util.TimeoutTimer.lambda$onTimeout$2(TimeoutTimer.java:101)
> at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
> at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:78)
> at org.apache.ratis.util.TimeoutTimer$Task.run(TimeoutTimer.java:55)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> 2024-06-03 17:00:48,585 INFO org.apache.hadoop.hdds.scm.XceiverClientRatis:
> Could not commit index 3637839 on pipeline Pipeline[ Id:
> 36a0af6d-5955-41a5-ba03-57de0b16d45f, Nodes:
> 882ad4eb-04f9-418e-9ea6-0802b19beade(DN-1/10.17.207.46)88b461a6-88b9-4635-8184-ede38ec14eef(DN-2/10.17.207.42)0d46c800-57c7-4a0d-b746-7e901ab77fa6(DN-3/10.17.207.49),
> ReplicationConfig: RATIS/THREE, State:OPEN,
> leaderId:882ad4eb-04f9-418e-9ea6-0802b19beade,
> CreationTimestamp2024-06-02T22:50:59.940-07:00[America/Los_Angeles]] to all
> the nodes. Server 88b461a6-88b9-4635-8184-ede38ec14eef has failed. Committed
> by majority. {code}
> And then after all retries are exhausted(after 5 minutes), it aborts:
> {code:java}
> 2024-06-03 17:06:28,955 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: ***** ABORTING region
> server RS-8,22101,1717394799821: WAL sync timeout,forcing server shutdown
> *****
> org.apache.hadoop.hbase.regionserver.wal.WALSyncTimeoutIOException:
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync
> result after 300000 ms for txid=1277245, WAL system stuck?
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:848)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:809)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:858)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.sync(HRegion.java:8900)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:8467)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4521)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4445)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4375)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4851)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4845)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4841)
> at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3153)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.put(RSRpcServices.java:3006)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2969)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45947)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:387)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:139)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to
> get sync result after 300000 ms for txid=1277245, WAL system stuck?
> at
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:171)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:844)
> {code}
> cc: [~ashishk] [~weichiu] [~Sammi]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]