[
https://issues.apache.org/jira/browse/HDDS-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pratyush Bhatt updated HDDS-11374:
----------------------------------
Summary: RegionServer Crashed with "java.io.InterruptedIOException" (was:
RegionServer Crashed with "java.io.InterruptedIOExceptio")
> RegionServer Crashed with "java.io.InterruptedIOException"
> ----------------------------------------------------------
>
> Key: HDDS-11374
> URL: https://issues.apache.org/jira/browse/HDDS-11374
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Pratyush Bhatt
> Priority: Major
>
> Ran on a setup with HDDS-9844 and the fix for KeyoutputStream HDDS-11239
> ozone.client.key.write.concurrency is set to -1 for unlimited concurrency.
> Triggered Bulkload and LoadTestTool on the cluster with Hbase on Ozone.
> The RegionServer Crashed abruptly.
> Logs just before the crash:
> {code:java}
> 2024-08-27 05:17:27,551 ERROR
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close
> of WAL
> java.io.InterruptedIOException
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:602)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:550)
> at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75)
> at
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669)
> 2024-08-27 05:17:27,552 WARN
> org.apache.hadoop.ozone.client.io.KeyOutputStream: Encountered exception
> java.io.IOException: Unexpected Storage Container Exception:
> java.lang.InterruptedException on the pipeline Pipeline[ Id:
> bdab774b-98bf-4f51-8986-ad27889b186a, Nodes:
> c02f4835-7103-4c1c-94b7-65f7d0ce02bd(vc0133.xyz/10.17.207.43) ReplicaIndex:
> 0319a1376-3f35-44a4-85e7-c364b9cb03f9(vc0132.xyz/10.17.207.42) ReplicaIndex:
> 080eebcdd-5257-4a30-98ce-c53943ba1016(vc0136.xyz/10.17.207.46) ReplicaIndex:
> 0, ReplicationConfig: RATIS/THREE, State:OPEN,
> leaderId:319a1376-3f35-44a4-85e7-c364b9cb03f9,
> CreationTimestamp2024-08-27T03:26:53.789-07:00[America/Los_Angeles]]. The
> last committed block length is 2362474, uncommitted data length is 0 retry
> count 0 {code}
> Logs during the crash:
> {code:java}
> 2024-08-27 05:17:27,554 ERROR
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close
> of WAL
> java.io.InterruptedIOException: Interrupted waiting to send RPC request to
> server
> at org.apache.hadoop.ipc.Client.call(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1459)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy25.submitRequest(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
> at com.sun.proxy.$Proxy25.submitRequest(Unknown Source)
> at
> org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)
> at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:343)
> at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:862)
> at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.hsyncKey(OzoneManagerProtocolClientSideTranslatorPB.java:807)
> at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.lambda$hsyncKey$1(BlockOutputStreamEntryPool.java:359)
> at
> org.apache.hadoop.ozone.util.MetricUtil.captureLatencyNs(MetricUtil.java:78)
> at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.hsyncKey(BlockOutputStreamEntryPool.java:358)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.lambda$null$3(KeyOutputStream.java:556)
> at
> org.apache.hadoop.ozone.util.MetricUtil.captureLatencyNs(MetricUtil.java:78)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.lambda$hsync$4(KeyOutputStream.java:555)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.doInWriteLock(KeyOutputStream.java:260)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:552)
> at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75)
> at
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669)
> Caused by: java.lang.InterruptedException
> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
> at java.util.concurrent.FutureTask.get(FutureTask.java:191)
> at
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1220)
> at org.apache.hadoop.ipc.Client.call(Client.java:1512)
> ... 33 more
> 2024-08-27 05:17:27,554 INFO
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry:
> BlockOutputStreamEntry:hbase/WALs/vc0139.xyz,22101,1724755172412/vc0139.xyz%2C22101%2C1724755172412.vc0139.xyz%2C22101%2C1724755172412.regiongroup-3.1724761021452
> conID: 3070 locID: 113750153625747794 bcsId: 0 replicaIndex: null : Done
> waiting for retry handling.
> 2024-08-27 05:17:27,555 INFO
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Slow sync cost: 332
> ms, current pipeline: []
> 2024-08-27 05:17:27,686 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: ***** ABORTING region
> server vc0139.xyz,22101,1724755172412: Failed log close in log roller *****
> org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException:
> ofs://ozone1724523753/hbasevol322708/hbasebuck/hbase/WALs/vc0139.xyz,22101,1724755172412/vc0139.xyz%2C22101%2C1724755172412.vc0139.xyz%2C22101%2C1724755172412.regiongroup-4.1724761011264,
> unflushedEntries=1
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:428)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:71)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:830)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:889)
> at
> org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:304)
> at
> org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:211)
> Caused by:
> org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException:
> org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append
> sequenceId=25700, requesting roll of WAL
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.checkIfSyncFailed(FSHLog.java:908)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:924)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:368)
> ... 5 more
> Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException:
> Append sequenceId=25700, requesting roll of WAL
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1208)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1081)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:982)
> at
> com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
> at
> com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: : Stream is closed! Key:
> hbase/WALs/vc0139.xyz,22101,1724755172412/vc0139.xyz%2C22101%2C1724755172412.vc0139.xyz%2C22101%2C1724755172412.regiongroup-4.1724761011264
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.checkNotClosed(KeyOutputStream.java:856)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:236)
> at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:94)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.lambda$write$1(OzoneFSOutputStream.java:58)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:54)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at
> org.apache.hbase.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.doFlush(CodedOutputStream.java:3041)
> at
> org.apache.hbase.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.flushIfNotAvailable(CodedOutputStream.java:3036)
> at
> org.apache.hbase.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.writeUInt64(CodedOutputStream.java:2726)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALKey.writeTo(WALProtos.java:1887)
> at
> org.apache.hbase.thirdparty.com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:95)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:55)
> {code}
> cc: [~duongnguyen] [~weichiu] [~ashishkr]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]