[
https://issues.apache.org/jira/browse/HDDS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Glen Geng updated HDDS-4408:
----------------------------
Description:
Datanode State Machine Thread should keep alive during the whole lifetime of
Datanode, since it periodic generates heartbeat tasks which trigger DN to
actively talk with DN. If this thread crashes, DN will become a zombie:
although it is alive, heartbeats between itself and SCM are stopped.
In Tencent internal production environment, we got several dead DNs which can
never come back without a restart.
We found that the thread "Datanode State Machine Thread - 0" does not exist in
the output of jstack, thus no HeartbeatEndpointTask will be created, this DN
will soon become dead and can not recover unless being restarted.
After checked the .out log, we saw that OOM occurred in thread "Datanode State
Machine Thread", which should be responsible for this issue:
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622
seconds, Stopping threads took: 0.0002926 seconds
Exception in thread "Datanode State Machine Thread - 0"
java.lang.OutOfMemoryError: GC overhead limit exceeded 114370.810: Application
time: 0.0115941 seconds {Heap before GC invocations=2946 (full 2680):
PSYoungGen total 3170304K, used 2846720K [0x00000006eab00000,
0x00000007c0000000, 0x00000007c0000000) eden space 2846720K, 100% used
[0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space 323584K,
0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space
324096K, 0% used [0x0000000798700000,0x0000000798700000,0x00000007ac380000)
ParOldGen total 6990848K, used 6990627K [0x0000000540000000,
0x00000006eab00000, 0x00000006eab00000) object space 6990848K, 99% used
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used
6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}
{code:java}
300010.579: Total time for which application threads were stopped: 3.0848769
seconds, Stopping threads took: 0.0000943 seconds
Exception in thread "Datanode State Machine Thread - 0"
java.lang.OutOfMemoryError: Java heap space
300010.579: Application time: 0.0001554 seconds
300010.580: Total time for which application threads were stopped: 0.0015600
seconds, Stopping threads took: 0.0002747 seconds
300010.581: Application time: 0.0004684 seconds
{Heap before GC invocations=13766 (full 11664):
PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000,
0x0000000800000000, 0x0000000800000000)
eden space 3388416K, 100% used
[0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
from space 53248K, 0% used
[0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
to space 53248K, 0% used
[0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
ParOldGen total 6990848K, used 6990848K [0x0000000580000000,
0x000000072ab00000, 0x000000072ab00000)
object space 6990848K, 100% used
[0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
class space used 5922K, capacity 6372K, committed 6744K, reserved
1048576K{code}
BTW, after running DN for more than a week, we see a lot of
"java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since we
configured a dead Recon, we guess this could an evidence for HDDS-4404.
was:
Datanode State Machine Thread should keep alive during the whole lifetime of
Datanode. Since it periodic generates heartbeat tasks which trigger DN to
actively talk with DN, it this thread crashes, DN will become a zombie:
although it is alive, heartbeats between DN and SCM are stopped.
In Tencent internal production environment, we got several dead DNs which can
never come back without a restart.
We found that the thread "Datanode State Machine Thread - 0" does not exist in
the output of jstack, thus no HeartbeatEndpointTask will be created, this DN
will soon become dead and can not recover unless being restarted.
After checked the .out log, we saw that OOM occurred in thread "Datanode State
Machine Thread", which should be responsible for this issue:
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622
seconds, Stopping threads took: 0.0002926 seconds
Exception in thread "Datanode State Machine Thread - 0"
java.lang.OutOfMemoryError: GC overhead limit exceeded 114370.810: Application
time: 0.0115941 seconds {Heap before GC invocations=2946 (full 2680):
PSYoungGen total 3170304K, used 2846720K [0x00000006eab00000,
0x00000007c0000000, 0x00000007c0000000) eden space 2846720K, 100% used
[0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space 323584K,
0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space
324096K, 0% used [0x0000000798700000,0x0000000798700000,0x00000007ac380000)
ParOldGen total 6990848K, used 6990627K [0x0000000540000000,
0x00000006eab00000, 0x00000006eab00000) object space 6990848K, 99% used
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used
6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}
{code:java}
300010.579: Total time for which application threads were stopped: 3.0848769
seconds, Stopping threads took: 0.0000943 seconds
Exception in thread "Datanode State Machine Thread - 0"
java.lang.OutOfMemoryError: Java heap space
300010.579: Application time: 0.0001554 seconds
300010.580: Total time for which application threads were stopped: 0.0015600
seconds, Stopping threads took: 0.0002747 seconds
300010.581: Application time: 0.0004684 seconds
{Heap before GC invocations=13766 (full 11664):
PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000,
0x0000000800000000, 0x0000000800000000)
eden space 3388416K, 100% used
[0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
from space 53248K, 0% used
[0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
to space 53248K, 0% used
[0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
ParOldGen total 6990848K, used 6990848K [0x0000000580000000,
0x000000072ab00000, 0x000000072ab00000)
object space 6990848K, 100% used
[0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
class space used 5922K, capacity 6372K, committed 6744K, reserved
1048576K{code}
BTW, after running DN for more than a week, we see a lot of
"java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since we
configured a dead Recon, we guess this could an evidence for HDDS-4404.
> Datanode State Machine Thread should keep alive during the whole lifetime of
> Datanode
> -------------------------------------------------------------------------------------
>
> Key: HDDS-4408
> URL: https://issues.apache.org/jira/browse/HDDS-4408
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Datanode
> Affects Versions: 1.1.0
> Reporter: Glen Geng
> Priority: Major
> Labels: pull-request-available
>
> Datanode State Machine Thread should keep alive during the whole lifetime of
> Datanode, since it periodic generates heartbeat tasks which trigger DN to
> actively talk with DN. If this thread crashes, DN will become a zombie:
> although it is alive, heartbeats between itself and SCM are stopped.
>
> In Tencent internal production environment, we got several dead DNs which can
> never come back without a restart.
>
> We found that the thread "Datanode State Machine Thread - 0" does not exist
> in the output of jstack, thus no HeartbeatEndpointTask will be created, this
> DN will soon become dead and can not recover unless being restarted.
>
> After checked the .out log, we saw that OOM occurred in thread "Datanode
> State Machine Thread", which should be responsible for this issue:
> {code:java}
> 114370.799: Total time for which application threads were stopped: 1.0883622
> seconds, Stopping threads took: 0.0002926 seconds
> Exception in thread "Datanode State Machine Thread - 0"
> java.lang.OutOfMemoryError: GC overhead limit exceeded 114370.810:
> Application time: 0.0115941 seconds {Heap before GC invocations=2946 (full
> 2680): PSYoungGen total 3170304K, used 2846720K [0x00000006eab00000,
> 0x00000007c0000000, 0x00000007c0000000) eden space 2846720K, 100% used
> [0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space
> 323584K, 0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000)
> to space 324096K, 0% used
> [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total
> 6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000,
> 0x00000006eab00000) object space 6990848K, 99% used
> [0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used
> 60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used
> 6583K, capacity 7031K, committed 7296K, reserved 1048576K
> {code}
>
> {code:java}
> 300010.579: Total time for which application threads were stopped: 3.0848769
> seconds, Stopping threads took: 0.0000943 seconds
> Exception in thread "Datanode State Machine Thread - 0"
> java.lang.OutOfMemoryError: Java heap space
> 300010.579: Application time: 0.0001554 seconds
> 300010.580: Total time for which application threads were stopped: 0.0015600
> seconds, Stopping threads took: 0.0002747 seconds
> 300010.581: Application time: 0.0004684 seconds
> {Heap before GC invocations=13766 (full 11664):
> PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000,
> 0x0000000800000000, 0x0000000800000000)
> eden space 3388416K, 100% used
> [0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
> from space 53248K, 0% used
> [0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
> to space 53248K, 0% used
> [0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
> ParOldGen total 6990848K, used 6990848K [0x0000000580000000,
> 0x000000072ab00000, 0x000000072ab00000)
> object space 6990848K, 100% used
> [0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
> Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
> class space used 5922K, capacity 6372K, committed 6744K, reserved
> 1048576K{code}
>
> BTW, after running DN for more than a week, we see a lot of
> "java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since
> we configured a dead Recon, we guess this could an evidence for HDDS-4404.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]