Pratyush Bhatt created HDDS-10700:
-------------------------------------

             Summary: [Hbase Ozone] Both HMaster down with "BlockOutputStream: 
Failed to commit BlockId"
                 Key: HDDS-10700
                 URL: https://issues.apache.org/jira/browse/HDDS-10700
             Project: Apache Ozone
          Issue Type: Bug
          Components: OM, SCM
            Reporter: Pratyush Bhatt


_DNs down in the cluster:_
vc0122.xyz, vc0130.xyz

Both HMasters aborts because of {_}3 way commit failed on the pipeline{_}, even 
though only one of the node in the pipeline is down.
Can see below that its trying to commit into a pipeline with nodes: vc0124.xyz, 
vc0123.xyz and vc0122.xyz, and only one of these i.e. vc0122.xyz is down. 

Error logs just before HMaster aborts:
{code:java}
2024-04-15 07:08:40,002 WARN org.apache.hadoop.hdds.scm.XceiverClientRatis: 3 
way commit failed on pipeline Pipeline[ Id: 
c3ac49c1-02c0-4d6c-aa2b-968ec9316380, Nodes: 
f7f57f95-7297-4a70-bc9a-bb2e8be573fc(vc0124.xyz/10.17.207.34)0246b7fe-2072-43f6-b2a3-aa6ca64cd94a(vc0123.xyz/10.17.207.33)0efdc82f-f19e-4bba-bc6b-ee61f973a24a(vc0122.xyz/10.17.207.32),
 ReplicationConfig: RATIS/THREE, State:OPEN, 
leaderId:0246b7fe-2072-43f6-b2a3-aa6ca64cd94a, 
CreationTimestamp2024-04-15T00:20:05.717-07:00[America/Los_Angeles]]
java.util.concurrent.ExecutionException: 
org.apache.ratis.protocol.exceptions.TimeoutIOException: 
client-188734B8DEB9->0246b7fe-2072-43f6-b2a3-aa6ca64cd94a request #41258 
timeout 180s
        at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at 
org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:279)
        at 
org.apache.hadoop.hdds.scm.storage.AbstractCommitWatcher.watchForCommit(AbstractCommitWatcher.java:142)
        at 
org.apache.hadoop.hdds.scm.storage.AbstractCommitWatcher.watchOnLastIndex(AbstractCommitWatcher.java:120)
        at 
org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.sendWatchForCommit(RatisBlockOutputStream.java:107)
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:454)
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlushInternal(BlockOutputStream.java:634)
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:587)
        at 
org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.hsync(RatisBlockOutputStream.java:139)
        at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.hsync(BlockOutputStreamEntry.java:163)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleStreamAction(KeyOutputStream.java:524)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:487)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:457)
        at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:70)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:65)
        at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
        at 
org.apache.hadoop.hbase.io.asyncfs.WrapperAsyncFSOutput.flush0(WrapperAsyncFSOutput.java:92)
        at 
org.apache.hadoop.hbase.io.asyncfs.WrapperAsyncFSOutput.lambda$flush$0(WrapperAsyncFSOutput.java:113)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ratis.protocol.exceptions.TimeoutIOException: 
client-188734B8DEB9->0246b7fe-2072-43f6-b2a3-aa6ca64cd94a request #41258 
timeout 180s
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$timeoutCheck$5(GrpcClientProtocolClient.java:374)
        at java.util.Optional.ifPresent(Optional.java:159)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:378)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.timeoutCheck(GrpcClientProtocolClient.java:373)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$onNext$1(GrpcClientProtocolClient.java:362)
        at 
org.apache.ratis.util.TimeoutTimer.lambda$onTimeout$2(TimeoutTimer.java:101)
        at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
        at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:78)
        at org.apache.ratis.util.TimeoutTimer$Task.run(TimeoutTimer.java:55)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
2024-04-15 07:08:40,018 INFO org.apache.hadoop.hdds.scm.XceiverClientRatis: 
Could not commit index 66131 on pipeline Pipeline[ Id: 
c3ac49c1-02c0-4d6c-aa2b-968ec9316380, Nodes: 
f7f57f95-7297-4a70-bc9a-bb2e8be573fc(vc0124.xyz/10.17.207.34)0246b7fe-2072-43f6-b2a3-aa6ca64cd94a(vc0123.xyz/10.17.207.33)0efdc82f-f19e-4bba-bc6b-ee61f973a24a(vc0122.xyz/10.17.207.32),
 ReplicationConfig: RATIS/THREE, State:OPEN, 
leaderId:0246b7fe-2072-43f6-b2a3-aa6ca64cd94a, 
CreationTimestamp2024-04-15T00:20:05.717-07:00[America/Los_Angeles]] to all the 
nodes. Server 0efdc82f-f19e-4bba-bc6b-ee61f973a24a has failed. Committed by 
majority.
2024-04-15 07:08:40,019 WARN 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream: Failed to commit BlockId 
conID: 2018 locID: 113750153625631788 bcsId: 66131 on Pipeline[ Id: 
c3ac49c1-02c0-4d6c-aa2b-968ec9316380, Nodes: 
f7f57f95-7297-4a70-bc9a-bb2e8be573fc(vc0124.xyz/10.17.207.34)0246b7fe-2072-43f6-b2a3-aa6ca64cd94a(vc0123.xyz/10.17.207.33)0efdc82f-f19e-4bba-bc6b-ee61f973a24a(vc0122.xyz/10.17.207.32),
 ReplicationConfig: RATIS/THREE, State:OPEN, 
leaderId:0246b7fe-2072-43f6-b2a3-aa6ca64cd94a, 
CreationTimestamp2024-04-15T00:20:05.717-07:00[America/Los_Angeles]]. Failed 
nodes: [0efdc82f-f19e-4bba-bc6b-ee61f973a24a(null/null)]
2024-04-15 07:08:40,156 WARN org.apache.hadoop.ozone.client.io.KeyOutputStream: 
Encountered exception java.io.IOException: Unexpected Storage Container 
Exception: java.util.concurrent.CompletionException: 
java.util.concurrent.CompletionException: java.lang.NullPointerException on the 
pipeline Pipeline[ Id: c3ac49c1-02c0-4d6c-aa2b-968ec9316380, Nodes: 
f7f57f95-7297-4a70-bc9a-bb2e8be573fc(vc0124.xyz/10.17.207.34)0246b7fe-2072-43f6-b2a3-aa6ca64cd94a(vc0123.xyz/10.17.207.33)0efdc82f-f19e-4bba-bc6b-ee61f973a24a(vc0122.xyz/10.17.207.32),
 ReplicationConfig: RATIS/THREE, State:OPEN, 
leaderId:0246b7fe-2072-43f6-b2a3-aa6ca64cd94a, 
CreationTimestamp2024-04-15T00:20:05.717-07:00[America/Los_Angeles]]. The last 
committed block length is 83, uncommitted data length is 0 retry count 0
2024-04-15 07:08:40,169 ERROR org.apache.hadoop.hbase.wal.AbstractWALRoller: 
Log rolling failed
java.lang.RuntimeException
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALMetadata(AsyncProtobufLogWriter.java:217)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:223)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:164)
        at 
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:116)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:726)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:129)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:886)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:304)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:211)
2024-04-15 07:08:40,234 ERROR org.apache.ratis.client.impl.OrderedAsync: 
client-546F8A3A61A4: Failed* 
RaftClientRequest:client-546F8A3A61A4->0efdc82f-f19e-4bba-bc6b-ee61f973a24a@group-534ED93B4EC4,
 cid=41261, seq=2768, RW, cmdType: WriteChunk
traceID: ""
containerID: 2008
datanodeUuid: "0efdc82f-f19e-4bba-bc6b-ee61f973a24a"
writeChunk {
  blockID {
    containerID: 2008
    localID: 113750153625630686
    blockCommitSequenceId: 256444
    replicaIndex: 0
  }
  chunkData {
    chunkName: "113750153625630686_chunk_1384"
    offset: 546232
    len: 8
    checksumData {
      type: CRC32
      bytesPerChecksum: 16384
      checksums: "[)E|"
    }
  }
}
encodedToken: 
"VwoFaGJhc2USJWNvbklEOiAyMDA4IGxvY0lEOiAxMTM3NTAxNTM2MjU2MzA2ODYYx9Gxue4xKAEoAjCAgICAAToWCIiEkcP56OT6iQEQ2c26m9bckYmpASBh2nTMHaIRg0MRIosCu98kApXmFzav-uYGKEQlKMWiIRBIRERTX0JMT0NLX1RPS0VOLGNvbklEOiAyMDA4IGxvY0lEOiAxMTM3NTAxNTM2MjU2MzA2ODYAAAAAAAAA"
, data.size=8
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.AlreadyClosedException: 
client-546F8A3A61A4->0efdc82f-f19e-4bba-bc6b-ee61f973a24a is closed.
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
        at 
java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
        at 
java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
        at 
org.apache.ratis.client.impl.OrderedAsync.sendRequest(OrderedAsync.java:239)
        at 
org.apache.ratis.client.impl.OrderedAsync.sendRequestWithRetry(OrderedAsync.java:203)
        at 
org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:303)
        at 
org.apache.ratis.util.SlidingWindow$Client.submitNewRequest(SlidingWindow.java:291)
        at org.apache.ratis.client.impl.OrderedAsync.send(OrderedAsync.java:171)
        at org.apache.ratis.client.impl.AsyncImpl.send(AsyncImpl.java:41)
        at org.apache.ratis.client.impl.AsyncImpl.send(AsyncImpl.java:46)
        at 
org.apache.hadoop.hdds.scm.XceiverClientRatis.lambda$sendRequestAsync$2(XceiverClientRatis.java:237)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:159)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at 
org.apache.hadoop.hdds.scm.XceiverClientRatis.sendRequestAsync(XceiverClientRatis.java:222)
        at 
org.apache.hadoop.hdds.scm.XceiverClientRatis.sendCommandAsync(XceiverClientRatis.java:325)
        at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync(ContainerProtocolCalls.java:441)
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.writeChunkToContainer(BlockOutputStream.java:800)
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.writeChunk(BlockOutputStream.java:573)
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlushInternal(BlockOutputStream.java:620)
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:587)
        at 
org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.hsync(RatisBlockOutputStream.java:139)
        at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.hsync(BlockOutputStreamEntry.java:163)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleStreamAction(KeyOutputStream.java:524)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:487)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:457)
        at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:70)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:65)
        at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
        at 
org.apache.hadoop.hbase.io.asyncfs.WrapperAsyncFSOutput.flush0(WrapperAsyncFSOutput.java:92)
        at 
org.apache.hadoop.hbase.io.asyncfs.WrapperAsyncFSOutput.lambda$flush$0(WrapperAsyncFSOutput.java:113)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ratis.protocol.exceptions.AlreadyClosedException: 
client-546F8A3A61A4->0efdc82f-f19e-4bba-bc6b-ee61f973a24a is closed. {code}
And just after this error HMaster aborts, exactly same pattern with the second 
HMaster too.
{code:java}
2024-04-15 07:08:45,184 ERROR org.apache.hadoop.hbase.master.HMaster: ***** 
ABORTING master vc0122.xyz,22001,1713187869079: Log rolling failed *****
java.lang.RuntimeException
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALMetadata(AsyncProtobufLogWriter.java:217)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:223)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:164)
        at 
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:116)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:726)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:129)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:886)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:304)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:211)
2024-04-15 07:08:45,188 INFO org.apache.ranger.plugin.util.PolicyRefresher: 
PolicyRefresher(serviceName=cm_hbase).run(): interrupted! Exiting thread
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
        at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at 
org.apache.ranger.plugin.util.PolicyRefresher.run(PolicyRefresher.java:208) 
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to