[
https://issues.apache.org/jira/browse/HDDS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arpit Agarwal updated HDDS-4408:
--------------------------------
Priority: Critical (was: Major)
> Datanode State Machine Thread should keep alive during the whole lifetime of
> Datanode
> -------------------------------------------------------------------------------------
>
> Key: HDDS-4408
> URL: https://issues.apache.org/jira/browse/HDDS-4408
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Datanode
> Affects Versions: 1.1.0
> Reporter: Glen Geng
> Assignee: Glen Geng
> Priority: Critical
> Labels: pull-request-available
>
> Datanode State Machine Thread should keep alive during the whole lifetime of
> Datanode, since it periodic generates heartbeat tasks which trigger DN to
> actively talk with DN. If this thread crashes, DN will become a zombie:
> although it is alive, heartbeats between itself and SCM are stopped.
>
> In Tencent internal production environment, we got several dead DNs which can
> never come back without a restart.
> We found that the thread "Datanode State Machine Thread - 0" does not exist
> in the output of jstack, thus no HeartbeatEndpointTask will be created, this
> DN will soon become dead and can not recover unless being restarted.
>
> After checked the .out log, we saw that OOM occurred in thread "Datanode
> State Machine Thread", which should be responsible for this issue:
> {code:java}
> 114370.799: Total time for which application threads were stopped: 1.0883622
> seconds, Stopping threads took: 0.0002926 seconds
> Exception in thread "Datanode State Machine Thread - 0"
> java.lang.OutOfMemoryError: GC overhead limit exceeded 114370.810:
> Application time: 0.0115941 seconds {Heap before GC invocations=2946 (full
> 2680): PSYoungGen total 3170304K, used 2846720K [0x00000006eab00000,
> 0x00000007c0000000, 0x00000007c0000000) eden space 2846720K, 100% used
> [0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space
> 323584K, 0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000)
> to space 324096K, 0% used
> [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total
> 6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000,
> 0x00000006eab00000) object space 6990848K, 99% used
> [0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used
> 60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used
> 6583K, capacity 7031K, committed 7296K, reserved 1048576K
> {code}
>
> {code:java}
> 300010.579: Total time for which application threads were stopped: 3.0848769
> seconds, Stopping threads took: 0.0000943 seconds
> Exception in thread "Datanode State Machine Thread - 0"
> java.lang.OutOfMemoryError: Java heap space
> 300010.579: Application time: 0.0001554 seconds
> 300010.580: Total time for which application threads were stopped: 0.0015600
> seconds, Stopping threads took: 0.0002747 seconds
> 300010.581: Application time: 0.0004684 seconds
> {Heap before GC invocations=13766 (full 11664):
> PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000,
> 0x0000000800000000, 0x0000000800000000)
> eden space 3388416K, 100% used
> [0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
> from space 53248K, 0% used
> [0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
> to space 53248K, 0% used
> [0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
> ParOldGen total 6990848K, used 6990848K [0x0000000580000000,
> 0x000000072ab00000, 0x000000072ab00000)
> object space 6990848K, 100% used
> [0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
> Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
> class space used 5922K, capacity 6372K, committed 6744K, reserved
> 1048576K{code}
>
> BTW, after running DN for more than a week, we see a lot of
> "java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since
> we configured a dead Recon, we guess this could an evidence for HDDS-4404.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]