Pratyush Bhatt created HDDS-11219:
-------------------------------------

             Summary: [HBase Replication] RS and Master Nodes down with 
"Waiting for one of pipelines to be OPEN failed"
                 Key: HDDS-11219
                 URL: https://issues.apache.org/jira/browse/HDDS-11219
             Project: Apache Ozone
          Issue Type: Bug
          Components: SCM
            Reporter: Pratyush Bhatt


*Scenario:* Bidirectional HBase replication, with HBase on Ozone on both the 
clusters.

After running for almost a day, and transferring approx 100GB of data, All RS 
and Master nodes of Cluster 2 went down.
This was there in most of the stack traces of failed roles, sample snippet from 
from one of the RS:
{code:java}
java.io.IOException: INTERNAL_ERROR 
org.apache.hadoop.ozone.om.exceptions.OMException: Unable to allocate a 
container to the block of size: 268435456, replicationConfig: RATIS/THREE. 
Waiting for one of pipelines to be OPEN failed. Pipeline 
d12aa22f-4439-4321-98cc-e245280b88dd,ae3ea458-ab25-4fb3-a380-941bee9c1fdb,06c3b21c-9721-49e5-9b24-04a3836036d3,3b07bee5-d3e7-4cfd-ad8a-5f336f6cf53c,416f70bc-8f6d-4f56-827f-e672d56507b2
 is not ready in 60000 ms
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:241)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleRetry(KeyOutputStream.java:413)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleException(KeyOutputStream.java:358)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:496)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:461)
        at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75)
        at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669)
Caused by: INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: 
Unable to allocate a container to the block of size: 268435456, 
replicationConfig: RATIS/THREE. Waiting for one of pipelines to be OPEN failed. 
Pipeline 
d12aa22f-4439-4321-98cc-e245280b88dd,ae3ea458-ab25-4fb3-a380-941bee9c1fdb,06c3b21c-9721-49e5-9b24-04a3836036d3,3b07bee5-d3e7-4cfd-ad8a-5f336f6cf53c,416f70bc-8f6d-4f56-827f-e672d56507b2
 is not ready in 60000 ms
        at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:755)
        at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleSubmitRequestAndSCMSafeModeRetry(OzoneManagerProtocolClientSideTranslatorPB.java:2328)
        at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:791)
        at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:303)
        at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:397)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:220)
        ... 12 more
2024-07-19 19:08:18,913 ERROR 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for 
region newtableloadtest,09999999,1721297943018.441fd24db6169f1d4c5ad7112b27d3b8.
org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append 
sequenceId=298963, requesting roll of WAL
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1208)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1081)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:982)
        at 
com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
        at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: : Stream is closed! Key: 
hbase/WALs/ccycloud-2.ozn-hbaserepl2.xyz,22101,1721293279189/ccycloud-2.ozn-hbaserepl2.xyz%2C22101%2C1721293279189.ccycloud-2.ozn-hbaserepl2.xyz%2C22101%2C1721293279189.regiongroup-0.1721415945748
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.checkNotClosed(KeyOutputStream.java:736)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:200)
        at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:94)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.lambda$write$1(OzoneFSOutputStream.java:58)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:54)
 {code}
>From SCM leader could see logs like below:
{code:java}
2024-07-19 19:07:52,242 ERROR [IPC Server handler 81 on 
9863]-org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider: 
Unable to allocate a block for the size: 268435456, repConfig: RATIS/THREE
2024-07-19 19:08:01,783 INFO 
[node3-EventQueue-PipelineReportForPipelineReportHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineReportHandler:
 Reported pipeline PipelineID=770772b8-ea18-4ca4-a5f7-76ceb53a8c01 is not found
2024-07-19 19:08:01,784 INFO [IPC Server handler 99 on 
9860]-org.apache.hadoop.ipc.Server: IPC Server handler 99 on 9860, call 
Call#3336 Retry#0 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol.submitRequest
 from 10.140.86.142:60838
org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: 
PipelineID=770772b8-ea18-4ca4-a5f7-76ceb53a8c01 not found
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:151)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:138)
        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:92)
        at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:75)
        at com.sun.proxy.$Proxy25.getPipeline(Unknown Source)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:335)
        at 
org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.getPipeline(SCMClientProtocolServer.java:761)
        at 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.getPipeline(StorageContainerLocationProtocolServerSideTranslatorPB.java:960)
        at 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:607)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
        at 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:232)
        at 
org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899) {code}
cc: [~weichiu] [~sammichen] [~ashishk] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to