[ 
https://issues.apache.org/jira/browse/HDDS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4408:
----------------------------
    Description: 
In Tencent internal production environment, we got several dead DNs which can 
never come back.

We found that thread "Datanode State Machine Thread - 0" does not exist in the 
jstack, thus no HeartbeatEndpointTask will be created, DNs will soon become 
dead and not recover unless being restarted.

 

After checked the .out log, we saw that OOM occurred in thread "Datanode State 
Machine Thread", which will kill the thread.
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622 
seconds, Stopping threads took: 0.0002926 seconds
Exception in thread "Datanode State Machine Thread - 0" 
java.lang.OutOfMemoryError: GC overhead limit exceeded
114370.810: Application time: 0.0115941 seconds
{Heap before GC invocations=2946 (full 2680):
 PSYoungGen      total 3170304K, used 2846720K [0x00000006eab00000, 
0x00000007c0000000, 0x00000007c0000000)
  eden space 2846720K, 100% used 
[0x00000006eab00000,0x0000000798700000,0x0000000798700000)
  from space 323584K, 0% used 
[0x00000007ac400000,0x00000007ac400000,0x00000007c0000000)
  to   space 324096K, 0% used 
[0x0000000798700000,0x0000000798700000,0x00000007ac380000)
 ParOldGen       total 6990848K, used 6990627K [0x0000000540000000, 
0x00000006eab00000, 0x00000006eab00000)
  object space 6990848K, 99% used 
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000)
 Metaspace       used 60721K, capacity 63446K, committed 64128K, reserved 
1105920K
  class space    used 6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}

  was:
In Tencent production environment, we got several dead DNs which can never come 
back.

We found that thread "Datanode State Machine Thread - 0" does not exist in the 
jstack, thus no HeartbeatEndpointTask will be created, DNs will soon become 
dead and not recover unless being restarted.

 

After checked the .out log, we saw that OOM occurred in thread "Datanode State 
Machine Thread", which will kill the thread.
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622 
seconds, Stopping threads took: 0.0002926 seconds
Exception in thread "Datanode State Machine Thread - 0" 
java.lang.OutOfMemoryError: GC overhead limit exceeded
114370.810: Application time: 0.0115941 seconds
{Heap before GC invocations=2946 (full 2680):
 PSYoungGen      total 3170304K, used 2846720K [0x00000006eab00000, 
0x00000007c0000000, 0x00000007c0000000)
  eden space 2846720K, 100% used 
[0x00000006eab00000,0x0000000798700000,0x0000000798700000)
  from space 323584K, 0% used 
[0x00000007ac400000,0x00000007ac400000,0x00000007c0000000)
  to   space 324096K, 0% used 
[0x0000000798700000,0x0000000798700000,0x00000007ac380000)
 ParOldGen       total 6990848K, used 6990627K [0x0000000540000000, 
0x00000006eab00000, 0x00000006eab00000)
  object space 6990848K, 99% used 
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000)
 Metaspace       used 60721K, capacity 63446K, committed 64128K, reserved 
1105920K
  class space    used 6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}


> Datanode State Machine Thread needs handle OutOfMemoryError
> -----------------------------------------------------------
>
>                 Key: HDDS-4408
>                 URL: https://issues.apache.org/jira/browse/HDDS-4408
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Priority: Major
>
> In Tencent internal production environment, we got several dead DNs which can 
> never come back.
> We found that thread "Datanode State Machine Thread - 0" does not exist in 
> the jstack, thus no HeartbeatEndpointTask will be created, DNs will soon 
> become dead and not recover unless being restarted.
>  
> After checked the .out log, we saw that OOM occurred in thread "Datanode 
> State Machine Thread", which will kill the thread.
> {code:java}
> 114370.799: Total time for which application threads were stopped: 1.0883622 
> seconds, Stopping threads took: 0.0002926 seconds
> Exception in thread "Datanode State Machine Thread - 0" 
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 114370.810: Application time: 0.0115941 seconds
> {Heap before GC invocations=2946 (full 2680):
>  PSYoungGen      total 3170304K, used 2846720K [0x00000006eab00000, 
> 0x00000007c0000000, 0x00000007c0000000)
>   eden space 2846720K, 100% used 
> [0x00000006eab00000,0x0000000798700000,0x0000000798700000)
>   from space 323584K, 0% used 
> [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000)
>   to   space 324096K, 0% used 
> [0x0000000798700000,0x0000000798700000,0x00000007ac380000)
>  ParOldGen       total 6990848K, used 6990627K [0x0000000540000000, 
> 0x00000006eab00000, 0x00000006eab00000)
>   object space 6990848K, 99% used 
> [0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000)
>  Metaspace       used 60721K, capacity 63446K, committed 64128K, reserved 
> 1105920K
>   class space    used 6583K, capacity 7031K, committed 7296K, reserved 
> 1048576K
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to