[jira] [Updated] (HDDS-4408) Datanode State Machine Thread should keep alive during the whole lifetime of Datanode

Glen Geng (Jira) Thu, 29 Oct 2020 19:28:31 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Glen Geng updated HDDS-4408:
----------------------------
    Description: 
Datanode State Machine Thread should keep alive during the whole lifetime of 
Datanode, since it periodic generates heartbeat tasks which trigger DN to 
actively talk with DN. If this thread crashes, DN will become a zombie: 
although it is alive, heartbeats between itself and SCM are stopped.

 

In Tencent internal production environment, we got several dead DNs which can 
never come back without a restart.

 

We found that the thread "Datanode State Machine Thread - 0" does not exist in 
the output of jstack, thus no HeartbeatEndpointTask will be created,  this DN 
will soon become dead and can not recover unless being restarted.

 

After checked the .out log, we saw that OOM occurred in thread "Datanode State 
Machine Thread", which should be responsible for this issue:
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622 
seconds, Stopping threads took: 0.0002926 seconds
Exception in thread "Datanode State Machine Thread - 0" 
java.lang.OutOfMemoryError: GC overhead limit exceeded 114370.810: Application 
time: 0.0115941 seconds {Heap before GC invocations=2946 (full 2680): 
PSYoungGen total 3170304K, used 2846720K [0x00000006eab00000, 
0x00000007c0000000, 0x00000007c0000000) eden space 2846720K, 100% used 
[0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space 323584K, 
0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space 
324096K, 0% used [0x0000000798700000,0x0000000798700000,0x00000007ac380000) 
ParOldGen total 6990848K, used 6990627K [0x0000000540000000, 
0x00000006eab00000, 0x00000006eab00000) object space 6990848K, 99% used 
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used 
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used 
6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}
 
{code:java}
300010.579: Total time for which application threads were stopped: 3.0848769 
seconds, Stopping threads took: 0.0000943 seconds
Exception in thread "Datanode State Machine Thread - 0" 
java.lang.OutOfMemoryError: Java heap space
300010.579: Application time: 0.0001554 seconds
300010.580: Total time for which application threads were stopped: 0.0015600 
seconds, Stopping threads took: 0.0002747 seconds
300010.581: Application time: 0.0004684 seconds
{Heap before GC invocations=13766 (full 11664):
 PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000, 
0x0000000800000000, 0x0000000800000000)
 eden space 3388416K, 100% used 
[0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
 from space 53248K, 0% used 
[0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
 to space 53248K, 0% used 
[0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
 ParOldGen total 6990848K, used 6990848K [0x0000000580000000, 
0x000000072ab00000, 0x000000072ab00000)
 object space 6990848K, 100% used 
[0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
 Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
 class space used 5922K, capacity 6372K, committed 6744K, reserved 
1048576K{code}
 

BTW, after running DN for more than a week, we see a lot of 
"java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since we 
configured a dead Recon, we guess this could an evidence for HDDS-4404.

  was:
Datanode State Machine Thread should keep alive during the whole lifetime of 
Datanode. Since it periodic generates heartbeat tasks which trigger DN to 
actively talk with DN, it this thread crashes, DN will become a zombie: 
although it is alive, heartbeats between DN and SCM are stopped.

 

In Tencent internal production environment, we got several dead DNs which can 
never come back without a restart.

 

We found that the thread "Datanode State Machine Thread - 0" does not exist in 
the output of jstack, thus no HeartbeatEndpointTask will be created,  this DN 
will soon become dead and can not recover unless being restarted.

 

After checked the .out log, we saw that OOM occurred in thread "Datanode State 
Machine Thread", which should be responsible for this issue:
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622 
seconds, Stopping threads took: 0.0002926 seconds
Exception in thread "Datanode State Machine Thread - 0" 
java.lang.OutOfMemoryError: GC overhead limit exceeded 114370.810: Application 
time: 0.0115941 seconds {Heap before GC invocations=2946 (full 2680): 
PSYoungGen total 3170304K, used 2846720K [0x00000006eab00000, 
0x00000007c0000000, 0x00000007c0000000) eden space 2846720K, 100% used 
[0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space 323584K, 
0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space 
324096K, 0% used [0x0000000798700000,0x0000000798700000,0x00000007ac380000) 
ParOldGen total 6990848K, used 6990627K [0x0000000540000000, 
0x00000006eab00000, 0x00000006eab00000) object space 6990848K, 99% used 
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used 
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used 
6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}
 
{code:java}
300010.579: Total time for which application threads were stopped: 3.0848769 
seconds, Stopping threads took: 0.0000943 seconds
Exception in thread "Datanode State Machine Thread - 0" 
java.lang.OutOfMemoryError: Java heap space
300010.579: Application time: 0.0001554 seconds
300010.580: Total time for which application threads were stopped: 0.0015600 
seconds, Stopping threads took: 0.0002747 seconds
300010.581: Application time: 0.0004684 seconds
{Heap before GC invocations=13766 (full 11664):
 PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000, 
0x0000000800000000, 0x0000000800000000)
 eden space 3388416K, 100% used 
[0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
 from space 53248K, 0% used 
[0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
 to space 53248K, 0% used 
[0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
 ParOldGen total 6990848K, used 6990848K [0x0000000580000000, 
0x000000072ab00000, 0x000000072ab00000)
 object space 6990848K, 100% used 
[0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
 Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
 class space used 5922K, capacity 6372K, committed 6744K, reserved 
1048576K{code}
 

BTW, after running DN for more than a week, we see a lot of 
"java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since we 
configured a dead Recon, we guess this could an evidence for HDDS-4404.


> Datanode State Machine Thread should keep alive during the whole lifetime of 
> Datanode
> -------------------------------------------------------------------------------------
>
>                 Key: HDDS-4408
>                 URL: https://issues.apache.org/jira/browse/HDDS-4408
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Priority: Major
>              Labels: pull-request-available
>
> Datanode State Machine Thread should keep alive during the whole lifetime of 
> Datanode, since it periodic generates heartbeat tasks which trigger DN to 
> actively talk with DN. If this thread crashes, DN will become a zombie: 
> although it is alive, heartbeats between itself and SCM are stopped.
>  
> In Tencent internal production environment, we got several dead DNs which can 
> never come back without a restart.
>  
> We found that the thread "Datanode State Machine Thread - 0" does not exist 
> in the output of jstack, thus no HeartbeatEndpointTask will be created,  this 
> DN will soon become dead and can not recover unless being restarted.
>  
> After checked the .out log, we saw that OOM occurred in thread "Datanode 
> State Machine Thread", which should be responsible for this issue:
> {code:java}
> 114370.799: Total time for which application threads were stopped: 1.0883622 
> seconds, Stopping threads took: 0.0002926 seconds
> Exception in thread "Datanode State Machine Thread - 0" 
> java.lang.OutOfMemoryError: GC overhead limit exceeded 114370.810: 
> Application time: 0.0115941 seconds {Heap before GC invocations=2946 (full 
> 2680): PSYoungGen total 3170304K, used 2846720K [0x00000006eab00000, 
> 0x00000007c0000000, 0x00000007c0000000) eden space 2846720K, 100% used 
> [0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space 
> 323584K, 0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) 
> to space 324096K, 0% used 
> [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total 
> 6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000, 
> 0x00000006eab00000) object space 6990848K, 99% used 
> [0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used 
> 60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used 
> 6583K, capacity 7031K, committed 7296K, reserved 1048576K
> {code}
>  
> {code:java}
> 300010.579: Total time for which application threads were stopped: 3.0848769 
> seconds, Stopping threads took: 0.0000943 seconds
> Exception in thread "Datanode State Machine Thread - 0" 
> java.lang.OutOfMemoryError: Java heap space
> 300010.579: Application time: 0.0001554 seconds
> 300010.580: Total time for which application threads were stopped: 0.0015600 
> seconds, Stopping threads took: 0.0002747 seconds
> 300010.581: Application time: 0.0004684 seconds
> {Heap before GC invocations=13766 (full 11664):
>  PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000, 
> 0x0000000800000000, 0x0000000800000000)
>  eden space 3388416K, 100% used 
> [0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
>  from space 53248K, 0% used 
> [0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
>  to space 53248K, 0% used 
> [0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
>  ParOldGen total 6990848K, used 6990848K [0x0000000580000000, 
> 0x000000072ab00000, 0x000000072ab00000)
>  object space 6990848K, 100% used 
> [0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
>  Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
>  class space used 5922K, capacity 6372K, committed 6744K, reserved 
> 1048576K{code}
>  
> BTW, after running DN for more than a week, we see a lot of 
> "java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since 
> we configured a dead Recon, we guess this could an evidence for HDDS-4404.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-4408) Datanode State Machine Thread should keep alive during the whole lifetime of Datanode

Reply via email to