[ 
https://issues.apache.org/jira/browse/HDDS-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854921#comment-17854921
 ] 

Wei-Chiu Chuang commented on HDDS-10978:
----------------------------------------

Yes we now deploy an updated configuration  
hdds.ratis.client.multilinear.random.retry.policy = 5s, 5

and that seems to avoid the problem now. 

> [Hbase Ozone] Regionserver down due to DN crash in Pipeline
> -----------------------------------------------------------
>
>                 Key: HDDS-10978
>                 URL: https://issues.apache.org/jira/browse/HDDS-10978
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Pratyush Bhatt
>            Priority: Major
>
> Two of the DNs(DN-1, DN-2) was abruptly shutted down in the cluster, the RS 
> then goes down with "3 way commit failed on pipeline".
> This cluster has:
>  
> {code:java}
> ozone.client.stream.putblock.piggybacking = true 
> ozone.client.incremental.chunk.list = true
> hdds.ratis.raft.client.rpc.watch.request.timeout=10s{code}
> The RS keeps on retrying till the timeout i.e. 5 Minutes
> Related logs:
> {code:java}
> 2024-06-03 17:00:48,583 WARN org.apache.hadoop.hdds.scm.XceiverClientRatis: 3 
> way commit failed on pipeline Pipeline[ Id: 
> 36a0af6d-5955-41a5-ba03-57de0b16d45f, Nodes: 
> 882ad4eb-04f9-418e-9ea6-0802b19beade(DN-1/10.17.207.46)88b461a6-88b9-4635-8184-ede38ec14eef(DN-2/10.17.207.42)0d46c800-57c7-4a0d-b746-7e901ab77fa6(DN-
> 3/10.17.207.49), ReplicationConfig: RATIS/THREE, State:OPEN, 
> leaderId:882ad4eb-04f9-418e-9ea6-0802b19beade, 
> CreationTimestamp2024-06-02T22:50:59.940-07:00[America/Los_Angeles]]
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.protocol.exceptions.TimeoutIOException: 
> client-555D064DF2EB->882ad4eb-04f9-418e-9ea6-0802b19beade request #3498690 
> timeout 10s
>         at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>         at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:280)
>         at 
> org.apache.hadoop.hdds.scm.storage.AbstractCommitWatcher.watchForCommit(AbstractCommitWatcher.java:142)
>         at 
> org.apache.hadoop.hdds.scm.storage.AbstractCommitWatcher.watchOnLastIndex(AbstractCommitWatcher.java:120)
>         at 
> org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.sendWatchForCommit(RatisBlockOutputStream.java:107)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:457)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlushInternal(BlockOutputStream.java:637)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:590)
>         at 
> org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.hsync(RatisBlockOutputStream.java:139)
>         at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.hsync(BlockOutputStreamEntry.java:163)
>         at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleStreamAction(KeyOutputStream.java:524)
>         at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:487)
>         at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:457)
>         at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
>         at 
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
>         at 
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
>         at 
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80)
>         at 
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75)
>         at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669)
> Caused by: org.apache.ratis.protocol.exceptions.TimeoutIOException: 
> client-555D064DF2EB->882ad4eb-04f9-418e-9ea6-0802b19beade request #3498690 
> timeout 10s
>         at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$timeoutCheck$5(GrpcClientProtocolClient.java:374)
>         at java.util.Optional.ifPresent(Optional.java:159)
>         at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:378)
>         at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.timeoutCheck(GrpcClientProtocolClient.java:373)
>         at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$onNext$1(GrpcClientProtocolClient.java:362)
>         at 
> org.apache.ratis.util.TimeoutTimer.lambda$onTimeout$2(TimeoutTimer.java:101)
>         at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
>         at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:78)
>         at org.apache.ratis.util.TimeoutTimer$Task.run(TimeoutTimer.java:55)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
> 2024-06-03 17:00:48,585 INFO org.apache.hadoop.hdds.scm.XceiverClientRatis: 
> Could not commit index 3637839 on pipeline Pipeline[ Id: 
> 36a0af6d-5955-41a5-ba03-57de0b16d45f, Nodes: 
> 882ad4eb-04f9-418e-9ea6-0802b19beade(DN-1/10.17.207.46)88b461a6-88b9-4635-8184-ede38ec14eef(DN-2/10.17.207.42)0d46c800-57c7-4a0d-b746-7e901ab77fa6(DN-3/10.17.207.49),
>  ReplicationConfig: RATIS/THREE, State:OPEN, 
> leaderId:882ad4eb-04f9-418e-9ea6-0802b19beade, 
> CreationTimestamp2024-06-02T22:50:59.940-07:00[America/Los_Angeles]] to all 
> the nodes. Server 88b461a6-88b9-4635-8184-ede38ec14eef has failed. Committed 
> by majority. {code}
> And then after all retries are exhausted(after 5 minutes), it aborts:
> {code:java}
> 2024-06-03 17:06:28,955 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ***** ABORTING region 
> server RS-8,22101,1717394799821: WAL sync timeout,forcing server shutdown 
> *****
> org.apache.hadoop.hbase.regionserver.wal.WALSyncTimeoutIOException: 
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 300000 ms for txid=1277245, WAL system stuck?
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:848)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:809)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:858)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.sync(HRegion.java:8900)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:8467)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4521)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4445)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4375)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4851)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4845)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4841)
>         at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3153)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.put(RSRpcServices.java:3006)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2969)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45947)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:387)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:139)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 300000 ms for txid=1277245, WAL system stuck?
>         at 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:171)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:844)
>  {code}
> cc: [~ashishk] [~weichiu] [~Sammi] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to