Pratyush Bhatt created HDDS-10700:
-------------------------------------
Summary: [Hbase Ozone] Both HMaster down with "BlockOutputStream:
Failed to commit BlockId"
Key: HDDS-10700
URL: https://issues.apache.org/jira/browse/HDDS-10700
Project: Apache Ozone
Issue Type: Bug
Components: OM, SCM
Reporter: Pratyush Bhatt
_DNs down in the cluster:_
vc0122.xyz, vc0130.xyz
Both HMasters aborts because of {_}3 way commit failed on the pipeline{_}, even
though only one of the node in the pipeline is down.
Can see below that its trying to commit into a pipeline with nodes: vc0124.xyz,
vc0123.xyz and vc0122.xyz, and only one of these i.e. vc0122.xyz is down.
Error logs just before HMaster aborts:
{code:java}
2024-04-15 07:08:40,002 WARN org.apache.hadoop.hdds.scm.XceiverClientRatis: 3
way commit failed on pipeline Pipeline[ Id:
c3ac49c1-02c0-4d6c-aa2b-968ec9316380, Nodes:
f7f57f95-7297-4a70-bc9a-bb2e8be573fc(vc0124.xyz/10.17.207.34)0246b7fe-2072-43f6-b2a3-aa6ca64cd94a(vc0123.xyz/10.17.207.33)0efdc82f-f19e-4bba-bc6b-ee61f973a24a(vc0122.xyz/10.17.207.32),
ReplicationConfig: RATIS/THREE, State:OPEN,
leaderId:0246b7fe-2072-43f6-b2a3-aa6ca64cd94a,
CreationTimestamp2024-04-15T00:20:05.717-07:00[America/Los_Angeles]]
java.util.concurrent.ExecutionException:
org.apache.ratis.protocol.exceptions.TimeoutIOException:
client-188734B8DEB9->0246b7fe-2072-43f6-b2a3-aa6ca64cd94a request #41258
timeout 180s
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at
org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:279)
at
org.apache.hadoop.hdds.scm.storage.AbstractCommitWatcher.watchForCommit(AbstractCommitWatcher.java:142)
at
org.apache.hadoop.hdds.scm.storage.AbstractCommitWatcher.watchOnLastIndex(AbstractCommitWatcher.java:120)
at
org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.sendWatchForCommit(RatisBlockOutputStream.java:107)
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:454)
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlushInternal(BlockOutputStream.java:634)
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:587)
at
org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.hsync(RatisBlockOutputStream.java:139)
at
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.hsync(BlockOutputStreamEntry.java:163)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleStreamAction(KeyOutputStream.java:524)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:487)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:457)
at
org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:70)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:65)
at
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
at
org.apache.hadoop.hbase.io.asyncfs.WrapperAsyncFSOutput.flush0(WrapperAsyncFSOutput.java:92)
at
org.apache.hadoop.hbase.io.asyncfs.WrapperAsyncFSOutput.lambda$flush$0(WrapperAsyncFSOutput.java:113)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ratis.protocol.exceptions.TimeoutIOException:
client-188734B8DEB9->0246b7fe-2072-43f6-b2a3-aa6ca64cd94a request #41258
timeout 180s
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$timeoutCheck$5(GrpcClientProtocolClient.java:374)
at java.util.Optional.ifPresent(Optional.java:159)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:378)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.timeoutCheck(GrpcClientProtocolClient.java:373)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$onNext$1(GrpcClientProtocolClient.java:362)
at
org.apache.ratis.util.TimeoutTimer.lambda$onTimeout$2(TimeoutTimer.java:101)
at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:78)
at org.apache.ratis.util.TimeoutTimer$Task.run(TimeoutTimer.java:55)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
2024-04-15 07:08:40,018 INFO org.apache.hadoop.hdds.scm.XceiverClientRatis:
Could not commit index 66131 on pipeline Pipeline[ Id:
c3ac49c1-02c0-4d6c-aa2b-968ec9316380, Nodes:
f7f57f95-7297-4a70-bc9a-bb2e8be573fc(vc0124.xyz/10.17.207.34)0246b7fe-2072-43f6-b2a3-aa6ca64cd94a(vc0123.xyz/10.17.207.33)0efdc82f-f19e-4bba-bc6b-ee61f973a24a(vc0122.xyz/10.17.207.32),
ReplicationConfig: RATIS/THREE, State:OPEN,
leaderId:0246b7fe-2072-43f6-b2a3-aa6ca64cd94a,
CreationTimestamp2024-04-15T00:20:05.717-07:00[America/Los_Angeles]] to all the
nodes. Server 0efdc82f-f19e-4bba-bc6b-ee61f973a24a has failed. Committed by
majority.
2024-04-15 07:08:40,019 WARN
org.apache.hadoop.hdds.scm.storage.BlockOutputStream: Failed to commit BlockId
conID: 2018 locID: 113750153625631788 bcsId: 66131 on Pipeline[ Id:
c3ac49c1-02c0-4d6c-aa2b-968ec9316380, Nodes:
f7f57f95-7297-4a70-bc9a-bb2e8be573fc(vc0124.xyz/10.17.207.34)0246b7fe-2072-43f6-b2a3-aa6ca64cd94a(vc0123.xyz/10.17.207.33)0efdc82f-f19e-4bba-bc6b-ee61f973a24a(vc0122.xyz/10.17.207.32),
ReplicationConfig: RATIS/THREE, State:OPEN,
leaderId:0246b7fe-2072-43f6-b2a3-aa6ca64cd94a,
CreationTimestamp2024-04-15T00:20:05.717-07:00[America/Los_Angeles]]. Failed
nodes: [0efdc82f-f19e-4bba-bc6b-ee61f973a24a(null/null)]
2024-04-15 07:08:40,156 WARN org.apache.hadoop.ozone.client.io.KeyOutputStream:
Encountered exception java.io.IOException: Unexpected Storage Container
Exception: java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException: java.lang.NullPointerException on the
pipeline Pipeline[ Id: c3ac49c1-02c0-4d6c-aa2b-968ec9316380, Nodes:
f7f57f95-7297-4a70-bc9a-bb2e8be573fc(vc0124.xyz/10.17.207.34)0246b7fe-2072-43f6-b2a3-aa6ca64cd94a(vc0123.xyz/10.17.207.33)0efdc82f-f19e-4bba-bc6b-ee61f973a24a(vc0122.xyz/10.17.207.32),
ReplicationConfig: RATIS/THREE, State:OPEN,
leaderId:0246b7fe-2072-43f6-b2a3-aa6ca64cd94a,
CreationTimestamp2024-04-15T00:20:05.717-07:00[America/Los_Angeles]]. The last
committed block length is 83, uncommitted data length is 0 retry count 0
2024-04-15 07:08:40,169 ERROR org.apache.hadoop.hbase.wal.AbstractWALRoller:
Log rolling failed
java.lang.RuntimeException
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALMetadata(AsyncProtobufLogWriter.java:217)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:223)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:164)
at
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:116)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:726)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:129)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:886)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:304)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:211)
2024-04-15 07:08:40,234 ERROR org.apache.ratis.client.impl.OrderedAsync:
client-546F8A3A61A4: Failed*
RaftClientRequest:client-546F8A3A61A4->0efdc82f-f19e-4bba-bc6b-ee61f973a24a@group-534ED93B4EC4,
cid=41261, seq=2768, RW, cmdType: WriteChunk
traceID: ""
containerID: 2008
datanodeUuid: "0efdc82f-f19e-4bba-bc6b-ee61f973a24a"
writeChunk {
blockID {
containerID: 2008
localID: 113750153625630686
blockCommitSequenceId: 256444
replicaIndex: 0
}
chunkData {
chunkName: "113750153625630686_chunk_1384"
offset: 546232
len: 8
checksumData {
type: CRC32
bytesPerChecksum: 16384
checksums: "[)E|"
}
}
}
encodedToken:
"VwoFaGJhc2USJWNvbklEOiAyMDA4IGxvY0lEOiAxMTM3NTAxNTM2MjU2MzA2ODYYx9Gxue4xKAEoAjCAgICAAToWCIiEkcP56OT6iQEQ2c26m9bckYmpASBh2nTMHaIRg0MRIosCu98kApXmFzav-uYGKEQlKMWiIRBIRERTX0JMT0NLX1RPS0VOLGNvbklEOiAyMDA4IGxvY0lEOiAxMTM3NTAxNTM2MjU2MzA2ODYAAAAAAAAA"
, data.size=8
java.util.concurrent.CompletionException:
org.apache.ratis.protocol.exceptions.AlreadyClosedException:
client-546F8A3A61A4->0efdc82f-f19e-4bba-bc6b-ee61f973a24a is closed.
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
at
java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
at
java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
at
org.apache.ratis.client.impl.OrderedAsync.sendRequest(OrderedAsync.java:239)
at
org.apache.ratis.client.impl.OrderedAsync.sendRequestWithRetry(OrderedAsync.java:203)
at
org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:303)
at
org.apache.ratis.util.SlidingWindow$Client.submitNewRequest(SlidingWindow.java:291)
at org.apache.ratis.client.impl.OrderedAsync.send(OrderedAsync.java:171)
at org.apache.ratis.client.impl.AsyncImpl.send(AsyncImpl.java:41)
at org.apache.ratis.client.impl.AsyncImpl.send(AsyncImpl.java:46)
at
org.apache.hadoop.hdds.scm.XceiverClientRatis.lambda$sendRequestAsync$2(XceiverClientRatis.java:237)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:159)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
at
org.apache.hadoop.hdds.scm.XceiverClientRatis.sendRequestAsync(XceiverClientRatis.java:222)
at
org.apache.hadoop.hdds.scm.XceiverClientRatis.sendCommandAsync(XceiverClientRatis.java:325)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync(ContainerProtocolCalls.java:441)
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.writeChunkToContainer(BlockOutputStream.java:800)
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.writeChunk(BlockOutputStream.java:573)
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlushInternal(BlockOutputStream.java:620)
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:587)
at
org.apache.hadoop.hdds.scm.storage.RatisBlockOutputStream.hsync(RatisBlockOutputStream.java:139)
at
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.hsync(BlockOutputStreamEntry.java:163)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleStreamAction(KeyOutputStream.java:524)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:487)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:457)
at
org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:70)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:65)
at
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
at
org.apache.hadoop.hbase.io.asyncfs.WrapperAsyncFSOutput.flush0(WrapperAsyncFSOutput.java:92)
at
org.apache.hadoop.hbase.io.asyncfs.WrapperAsyncFSOutput.lambda$flush$0(WrapperAsyncFSOutput.java:113)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ratis.protocol.exceptions.AlreadyClosedException:
client-546F8A3A61A4->0efdc82f-f19e-4bba-bc6b-ee61f973a24a is closed. {code}
And just after this error HMaster aborts, exactly same pattern with the second
HMaster too.
{code:java}
2024-04-15 07:08:45,184 ERROR org.apache.hadoop.hbase.master.HMaster: *****
ABORTING master vc0122.xyz,22001,1713187869079: Log rolling failed *****
java.lang.RuntimeException
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALMetadata(AsyncProtobufLogWriter.java:217)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:223)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:164)
at
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:116)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:726)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:129)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:886)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:304)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:211)
2024-04-15 07:08:45,188 INFO org.apache.ranger.plugin.util.PolicyRefresher:
PolicyRefresher(serviceName=cm_hbase).run(): interrupted! Exiting thread
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.ranger.plugin.util.PolicyRefresher.run(PolicyRefresher.java:208)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]