[ 
https://issues.apache.org/jira/browse/RATIS-878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-878:
-----------------------------
    Description: 
*What's the problem ?*

 When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
found there are 460710 instances of GrpcLogAppender. But there are only 6 
instances of SenderList, and each SenderList contains 1-2 instance of 
GrpcLogAppender. And there are a lot of logs related to 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
 {code:java}INFO impl.RaftServerImpl: 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting 
GrpcLogAppender for 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
 

 So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
removed from senders. 
 !screenshot-2.png! 

 !screenshot-3.png! 
 
*Why 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
 so many times ?*
1. As the image shows, when remove group, SegmentedRaftLog will close, then 
GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then 
GrpcLogAppender will be 
[restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
 and the new GrpcLogAppender throw exception again when find the 
SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . 
It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: 
RaftServerImpl::shutdown -> 
[RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
 -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will 
be closed:  RaftServerImpl::shutdown -> 
[ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
 ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
 !screenshot-4.png! 

More details please refer it here 
[RATIS-840|https://issues.apache.org/jira/browse/RATIS-840].

  was:
*What's the problem ?*

 When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
found there are 460710 instances of GrpcLogAppender. But there are only 6 
instances of SenderList, and each SenderList contains 1-2 instance of 
GrpcLogAppender. And there are a lot of logs related to 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
 {code:java}INFO impl.RaftServerImpl: 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting 
GrpcLogAppender for 
1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
 

 So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
removed from senders. 

 !image-2020-04-06-14-27-28-485.png! 

 !image-2020-04-06-14-27-39-582.png! 
 
*Why 
[LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
 so many times ?*
1. As the image shows, when remove group, SegmentedRaftLog will close, then 
GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then 
GrpcLogAppender will be 
[restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
 and the new GrpcLogAppender throw exception again when find the 
SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . 
It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: 
RaftServerImpl::shutdown -> 
[RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
 -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will 
be closed:  RaftServerImpl::shutdown -> 
[ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
 ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
 !screenshot-1.png! 

The details please refer it here 
[RATIS-840|https://issues.apache.org/jira/browse/RATIS-840].


> Infinite restart of  LogAppender
> --------------------------------
>
>                 Key: RATIS-878
>                 URL: https://issues.apache.org/jira/browse/RATIS-878
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Blocker
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !screenshot-2.png! 
>  !screenshot-3.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-4.png! 
> More details please refer it here 
> [RATIS-840|https://issues.apache.org/jira/browse/RATIS-840].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to