Pratyush Bhatt created HDDS-11374:
-------------------------------------

             Summary: RegionServer Crashed with "java.io.InterruptedIOExceptio"
                 Key: HDDS-11374
                 URL: https://issues.apache.org/jira/browse/HDDS-11374
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Pratyush Bhatt


Ran on a setup with HDDS-9844 and the fix for KeyoutputStream HDDS-11239
ozone.client.key.write.concurrency is set to -1 for unlimited concurrency.

Triggered Bulkload and LoadTestTool on the cluster with Hbase on Ozone.

The RegionServer Crashed abruptly.

Logs just before the crash:
{code:java}
2024-08-27 05:17:27,551 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error syncing, request close of WAL
java.io.InterruptedIOException
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:602)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:550)
        at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75)
        at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669)
2024-08-27 05:17:27,552 WARN org.apache.hadoop.ozone.client.io.KeyOutputStream: 
Encountered exception java.io.IOException: Unexpected Storage Container 
Exception: java.lang.InterruptedException on the pipeline Pipeline[ Id: 
bdab774b-98bf-4f51-8986-ad27889b186a, Nodes: 
c02f4835-7103-4c1c-94b7-65f7d0ce02bd(vc0133.xyz/10.17.207.43) ReplicaIndex: 
0319a1376-3f35-44a4-85e7-c364b9cb03f9(vc0132.xyz/10.17.207.42) ReplicaIndex: 
080eebcdd-5257-4a30-98ce-c53943ba1016(vc0136.xyz/10.17.207.46) ReplicaIndex: 0, 
ReplicationConfig: RATIS/THREE, State:OPEN, 
leaderId:319a1376-3f35-44a4-85e7-c364b9cb03f9, 
CreationTimestamp2024-08-27T03:26:53.789-07:00[America/Los_Angeles]]. The last 
committed block length is 2362474, uncommitted data length is 0 retry count 0 
{code}

Logs during the crash:
{code:java}
2024-08-27 05:17:27,554 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error syncing, request close of WAL
java.io.InterruptedIOException: Interrupted waiting to send RPC request to 
server
        at org.apache.hadoop.ipc.Client.call(Client.java:1517)
        at org.apache.hadoop.ipc.Client.call(Client.java:1459)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy25.submitRequest(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
        at com.sun.proxy.$Proxy25.submitRequest(Unknown Source)
        at 
org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)
        at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:343)
        at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:862)
        at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.hsyncKey(OzoneManagerProtocolClientSideTranslatorPB.java:807)
        at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.lambda$hsyncKey$1(BlockOutputStreamEntryPool.java:359)
        at 
org.apache.hadoop.ozone.util.MetricUtil.captureLatencyNs(MetricUtil.java:78)
        at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.hsyncKey(BlockOutputStreamEntryPool.java:358)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.lambda$null$3(KeyOutputStream.java:556)
        at 
org.apache.hadoop.ozone.util.MetricUtil.captureLatencyNs(MetricUtil.java:78)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.lambda$hsync$4(KeyOutputStream.java:555)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.doInWriteLock(KeyOutputStream.java:260)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:552)
        at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75)
        at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669)
Caused by: java.lang.InterruptedException
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1220)
        at org.apache.hadoop.ipc.Client.call(Client.java:1512)
        ... 33 more
2024-08-27 05:17:27,554 INFO 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry: 
BlockOutputStreamEntry:hbase/WALs/vc0139.xyz,22101,1724755172412/vc0139.xyz%2C22101%2C1724755172412.vc0139.xyz%2C22101%2C1724755172412.regiongroup-3.1724761021452
 conID: 3070 locID: 113750153625747794 bcsId: 0 replicaIndex: null : Done 
waiting for retry handling.
2024-08-27 05:17:27,555 INFO 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Slow sync cost: 332 ms, 
current pipeline: []
2024-08-27 05:17:27,686 ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer: ***** ABORTING region 
server vc0139.xyz,22101,1724755172412: Failed log close in log roller *****
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: 
ofs://ozone1724523753/hbasevol322708/hbasebuck/hbase/WALs/vc0139.xyz,22101,1724755172412/vc0139.xyz%2C22101%2C1724755172412.vc0139.xyz%2C22101%2C1724755172412.regiongroup-4.1724761011264,
 unflushedEntries=1
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:428)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:71)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:830)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:889)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:304)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:211)
Caused by: 
org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException: 
org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append 
sequenceId=25700, requesting roll of WAL
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.checkIfSyncFailed(FSHLog.java:908)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:924)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:368)
        ... 5 more
Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append 
sequenceId=25700, requesting roll of WAL
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1208)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1081)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:982)
        at 
com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
        at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: : Stream is closed! Key: 
hbase/WALs/vc0139.xyz,22101,1724755172412/vc0139.xyz%2C22101%2C1724755172412.vc0139.xyz%2C22101%2C1724755172412.regiongroup-4.1724761011264
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.checkNotClosed(KeyOutputStream.java:856)
        at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:236)
        at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:94)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.lambda$write$1(OzoneFSOutputStream.java:58)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:54)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at 
org.apache.hbase.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.doFlush(CodedOutputStream.java:3041)
        at 
org.apache.hbase.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.flushIfNotAvailable(CodedOutputStream.java:3036)
        at 
org.apache.hbase.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.writeUInt64(CodedOutputStream.java:2726)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALKey.writeTo(WALProtos.java:1887)
        at 
org.apache.hbase.thirdparty.com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:95)
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:55)
 {code}
cc: [~duongnguyen] [~weichiu] [~ashishkr] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to