[ 
https://issues.apache.org/jira/browse/HDDS-11391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDDS-11391:
--------------------------------------

    Assignee: Wei-Chiu Chuang

> Frequent Ozone DN Crashes During OM + DN Decommission with Freon
> ----------------------------------------------------------------
>
>                 Key: HDDS-11391
>                 URL: https://issues.apache.org/jira/browse/HDDS-11391
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Pratyush Bhatt
>            Assignee: Wei-Chiu Chuang
>            Priority: Major
>
> *Scenario:*
> Decommission OM + DN and trigger write workload using freon.
> *Observations:*
> The freon workload failed:
> {code:java}
> 24/08/29 14:45:03 INFO freon.BaseFreonGenerator: Total execution time (sec): 
> 827
> 24/08/29 14:45:03 INFO freon.BaseFreonGenerator: Failures: 28
> 24/08/29 14:45:03 INFO freon.BaseFreonGenerator: Successful executions: 65006 
> {code}
> Checked the Cluster, Many other DNs were down apart from the decommissioned DN
> In the DN logs(example for DN-7) just before the crash there were multiple 
> types of exception:
> 1. "too many open files" errors:
> {code:java}
> java.util.concurrent.CompletionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  com.google.common.util.concurrent.UncheckedExecutionException: 
> java.io.UncheckedIOException: java.io.FileNotFoundException: 
> /hadoop-ozone/datanode/data/hdds/CID-6355ce15-c6ce-4589-88d1-28a070f1a673/current/containerDir3/2003/chunks/113750153625666987.block
>  (Too many open files)
>         at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
>         at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
>         at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:870)
>         at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>         at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>         at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog$Task.failed(SegmentedRaftLog.java:111)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.failed(SegmentedRaftLogWorker.java:521)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:319)
>         at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  com.google.common.util.concurrent.UncheckedExecutionException: 
> java.io.UncheckedIOException: java.io.FileNotFoundException: 
> /hadoop-ozone/datanode/data/hdds/CID-6355ce15-c6ce-4589-88d1-28a070f1a673/current/containerDir3/2003/chunks/113750153625666987.block
>  (Too many open files)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$writeStateMachineData$4(ContainerStateMachine.java:589)
>         at 
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)
>         at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>         at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705)
>         ... 3 more {code}
> 2. Failed to take snapshot
> {code:java}
> org.apache.ratis.protocol.exceptions.StateMachineException: Failed to take 
> snapshot  for group-BC75686652BA as the stateMachine is unhealthy. The last 
> applied index is at (t:4, i:30263)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.takeSnapshot(ContainerStateMachine.java:356)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.takeSnapshot(StateMachineUpdater.java:286)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.checkAndTakeSnapshot(StateMachineUpdater.java:278)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:197)
>         at java.base/java.lang.Thread.run(Thread.java:834) {code}
> 3. java.io.IOException: Bad address
> {code:java}
> 2024-08-29 14:37:00,986 ERROR 
> [4fd3eede-f977-451e-b925-73d3075b3118-impl-thread5]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream:
>  Failed to flush SegmentedRaftLogOutputStream(log_inprogress_47)
> java.io.IOException: Bad address
>         at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>         at 
> java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62)
>         at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
>         at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:58)
>         at 
> java.base/sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:280)
>         at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.writeToChannel(BufferedWriteChannel.java:115)
>         at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.flushBuffer(BufferedWriteChannel.java:169)
>         at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.flush(BufferedWriteChannel.java:136)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.flush(SegmentedRaftLogOutputStream.java:127)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.close(SegmentedRaftLogOutputStream.java:115)
>         at org.apache.ratis.util.IOUtils.cleanup(IOUtils.java:183)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.close(SegmentedRaftLogWorker.java:247)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.close(SegmentedRaftLog.java:535)
>         at 
> org.apache.ratis.server.impl.ServerState.close(ServerState.java:434)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$close$1(RaftServerImpl.java:523)
>         at 
> org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$7(LifeCycle.java:306)
>         at 
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:326)
>         at 
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:304)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.close(RaftServerImpl.java:500)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy$ImplMap.close(RaftServerProxy.java:137)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy$ImplMap.lambda$close$0(RaftServerProxy.java:124)
>         at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:203)
>         at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834) {code}
> And then the DN finally crashes:
> {code:java}
> 2024-08-29 14:37:03,045 ERROR 
> [4fd3eede-f977-451e-b925-73d3075b3118-VolumeCheckResultHandlerThread-1]-org.apache.hadoop.ozone.HddsDatanodeService:
>  Stopping HttpServer is failed.
> java.lang.InterruptedException
>         at 
> java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1367)
>         at 
> java.base/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:278)
>         at 
> org.eclipse.jetty.server.AbstractConnector.doStop(AbstractConnector.java:373)
>         at 
> org.eclipse.jetty.server.AbstractNetworkConnector.doStop(AbstractNetworkConnector.java:88)
>         at 
> org.eclipse.jetty.server.ServerConnector.doStop(ServerConnector.java:246)
>         at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:94)
>         at org.eclipse.jetty.server.Server.doStop(Server.java:459)
>         at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:94)
>         at 
> org.apache.hadoop.hdds.server.http.HttpServer2.stop(HttpServer2.java:1363)
>         at 
> org.apache.hadoop.hdds.server.http.BaseHttpServer.stop(BaseHttpServer.java:339)
>         at 
> org.apache.hadoop.ozone.HddsDatanodeService.stop(HddsDatanodeService.java:543)
>         at 
> org.apache.hadoop.ozone.HddsDatanodeService.terminateDatanode(HddsDatanodeService.java:521)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.handleFatalVolumeFailures(DatanodeStateMachine.java:382)
>         at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.handleVolumeFailures(MutableVolumeSet.java:251)
>         at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.lambda$checkVolumeAsync$0(MutableVolumeSet.java:279)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker$ResultHandler.invokeCallback(StorageVolumeChecker.java:387)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker$ResultHandler.cleanup(StorageVolumeChecker.java:380)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker$ResultHandler.onSuccess(StorageVolumeChecker.java:354)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker$ResultHandler.onSuccess(StorageVolumeChecker.java:298)
>         at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1133)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> 2024-08-29 14:37:03,045 INFO 
> [4fd3eede-f977-451e-b925-73d3075b3118-VolumeCheckResultHandlerThread-1]-org.apache.hadoop.ozone.HddsDatanodeClientProtocolServer:
>  Stopping the RPC server for Client Protocol
> 2024-08-29 14:37:03,046 INFO 
> [4fd3eede-f977-451e-b925-73d3075b3118-VolumeCheckResultHandlerThread-1]-org.apache.hadoop.ipc.Server:
>  Stopping server on 19864
> 2024-08-29 14:37:03,048 INFO [IPC Server listener on 
> 19864]-org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 19864
> 2024-08-29 14:37:03,048 INFO [IPC Server 
> Responder]-org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
> 2024-08-29 14:37:03,050 INFO 
> [4fd3eede-f977-451e-b925-73d3075b3118-VolumeCheckResultHandlerThread-1]-org.apache.hadoop.util.ExitUtil:
>  Exiting with status 1: ExitException
> 2024-08-29 14:37:03,052 INFO 
> [shutdown-hook-0]-org.apache.hadoop.ozone.HddsDatanodeService: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down HddsDatanodeService at DN-7/10.140.118.4
> ************************************************************/ {code}
> cc: [~weichiu] [~ashishkr] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to