Pratyush Bhatt created HDDS-11219:
-------------------------------------
Summary: [HBase Replication] RS and Master Nodes down with
"Waiting for one of pipelines to be OPEN failed"
Key: HDDS-11219
URL: https://issues.apache.org/jira/browse/HDDS-11219
Project: Apache Ozone
Issue Type: Bug
Components: SCM
Reporter: Pratyush Bhatt
*Scenario:* Bidirectional HBase replication, with HBase on Ozone on both the
clusters.
After running for almost a day, and transferring approx 100GB of data, All RS
and Master nodes of Cluster 2 went down.
This was there in most of the stack traces of failed roles, sample snippet from
from one of the RS:
{code:java}
java.io.IOException: INTERNAL_ERROR
org.apache.hadoop.ozone.om.exceptions.OMException: Unable to allocate a
container to the block of size: 268435456, replicationConfig: RATIS/THREE.
Waiting for one of pipelines to be OPEN failed. Pipeline
d12aa22f-4439-4321-98cc-e245280b88dd,ae3ea458-ab25-4fb3-a380-941bee9c1fdb,06c3b21c-9721-49e5-9b24-04a3836036d3,3b07bee5-d3e7-4cfd-ad8a-5f336f6cf53c,416f70bc-8f6d-4f56-827f-e672d56507b2
is not ready in 60000 ms
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:241)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleRetry(KeyOutputStream.java:413)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleException(KeyOutputStream.java:358)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:496)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:461)
at
org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75)
at
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669)
Caused by: INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException:
Unable to allocate a container to the block of size: 268435456,
replicationConfig: RATIS/THREE. Waiting for one of pipelines to be OPEN failed.
Pipeline
d12aa22f-4439-4321-98cc-e245280b88dd,ae3ea458-ab25-4fb3-a380-941bee9c1fdb,06c3b21c-9721-49e5-9b24-04a3836036d3,3b07bee5-d3e7-4cfd-ad8a-5f336f6cf53c,416f70bc-8f6d-4f56-827f-e672d56507b2
is not ready in 60000 ms
at
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:755)
at
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleSubmitRequestAndSCMSafeModeRetry(OzoneManagerProtocolClientSideTranslatorPB.java:2328)
at
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:791)
at
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:303)
at
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:397)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:220)
... 12 more
2024-07-19 19:08:18,913 ERROR
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for
region newtableloadtest,09999999,1721297943018.441fd24db6169f1d4c5ad7112b27d3b8.
org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append
sequenceId=298963, requesting roll of WAL
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1208)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1081)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:982)
at
com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
at
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: : Stream is closed! Key:
hbase/WALs/ccycloud-2.ozn-hbaserepl2.xyz,22101,1721293279189/ccycloud-2.ozn-hbaserepl2.xyz%2C22101%2C1721293279189.ccycloud-2.ozn-hbaserepl2.xyz%2C22101%2C1721293279189.regiongroup-0.1721415945748
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.checkNotClosed(KeyOutputStream.java:736)
at
org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:200)
at
org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:94)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.lambda$write$1(OzoneFSOutputStream.java:58)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:54)
{code}
>From SCM leader could see logs like below:
{code:java}
2024-07-19 19:07:52,242 ERROR [IPC Server handler 81 on
9863]-org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider:
Unable to allocate a block for the size: 268435456, repConfig: RATIS/THREE
2024-07-19 19:08:01,783 INFO
[node3-EventQueue-PipelineReportForPipelineReportHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineReportHandler:
Reported pipeline PipelineID=770772b8-ea18-4ca4-a5f7-76ceb53a8c01 is not found
2024-07-19 19:08:01,784 INFO [IPC Server handler 99 on
9860]-org.apache.hadoop.ipc.Server: IPC Server handler 99 on 9860, call
Call#3336 Retry#0
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol.submitRequest
from 10.140.86.142:60838
org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException:
PipelineID=770772b8-ea18-4ca4-a5f7-76ceb53a8c01 not found
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:151)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:138)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:92)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:75)
at com.sun.proxy.$Proxy25.getPipeline(Unknown Source)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:335)
at
org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.getPipeline(SCMClientProtocolServer.java:761)
at
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.getPipeline(StorageContainerLocationProtocolServerSideTranslatorPB.java:960)
at
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:607)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:232)
at
org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899) {code}
cc: [~weichiu] [~sammichen] [~ashishk]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]