[jira] [Comment Edited] (RATIS-840) Memory leak of LogAppender

runzhiwang (Jira) Tue, 21 Apr 2020 17:02:14 -0700


    [ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089143#comment-17089143
 ]


runzhiwang edited comment on RATIS-840 at 4/22/20, 12:00 AM:
-------------------------------------------------------------

[~szetszwo] Thanks for you explanation.
{quote}There may be a bug but the correct fix is not to pollute SenderList 
since the bug is not in SenderList.  I am fine to add the stop call 
outside.{quote}
I agree with you. I will change the patch.
{quote}As long as GrpcLogAppender could stop, it can be infinitely restarted.  
We don't have to wait until it becomes dead.{quote}
I agree with you. 

For this problem if wait old GrpcLogAppender  becomes dead, the GrpcLogAppender 
 will not restart infinitely. Because stop old GrpcLogAppender  called before 
log.close, if old GrpcLogAppender  also dead before log.close. Then old 
GrpcLogAppender  has no change to throw exception when log.close, because it 
has dead, then GrpcLogAppender  will not restart infinitely.

For other problem, such as the exception was not caused by log.close, to avoid 
restart infinitely, I add the condition server.isAlive(). But it is also not 
sufficient, for example the server.isAlive, but the restarted GrpcLogAppender  
can not work as expected and throw exception, it will also restart infinitely.

Or maybe we should limit the frequency of GrpcLogAppender  restart. If it 
exceeds the threshold, we should stop restart forever.

If exist better solution, please share me, and I will change the patch.


was (Author: yjxxtd):
[~szetszwo] Thanks for you explanation.
{quote}There may be a bug but the correct fix is not to pollute SenderList 
since the bug is not in SenderList.  I am fine to add the stop call 
outside.{quote}
I agree with you. I will change the patch.
{quote}As long as GrpcLogAppender could stop, it can be infinitely restarted.  
We don't have to wait until it becomes dead.{quote}
I agree with you. 
For this problem if wait old GrpcLogAppender  becomes dead, the GrpcLogAppender 
 will not restart infinitely. Because stop old GrpcLogAppender  called before 
log.close, if old GrpcLogAppender  also dead before log.close. Then old 
GrpcLogAppender  has no change to throw exception when log.close, because it 
has dead, then GrpcLogAppender  will not restart infinitely.

For other problem, such as the exception was not caused by log.close, to avoid 
restart infinitely, I add the condition server.isAlive(). But it is also not 
sufficient, for example the server.isAlive, but the restarted GrpcLogAppender  
can not work as expected and throw exception, it will also restart infinitely.

Or maybe we should limit the frequency of GrpcLogAppender  restart. If it 
exceeds the threshold, we should stop restart forever.

If exist better solution, please share me, and I will change the patch.

> Memory leak of LogAppender
> --------------------------
>
>                 Key: RATIS-840
>                 URL: https://issues.apache.org/jira/browse/RATIS-840
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Critical
>         Attachments: RATIS-840.001.patch, RATIS-840.002.patch, 
> image-2020-04-06-14-27-28-485.png, image-2020-04-06-14-27-39-582.png, 
> screenshot-1.png, screenshot-2.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
>  I find a lot of GrpcLogAppender blocked inside logs4j. I think it's 
> GrpcLogAppender restart too fast, then blocked in logs4j.
>  !screenshot-2.png! 
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
> FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
> 3. When the new created GrpcLogAppender append entry to follower, then the 
> follower response SUCCESS.
> 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
>  -> 
> [voterLists.get(0) | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607].
>  {color:#DE350B}Error happens because voterLists.get(0) return the 
> FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new 
> GrpcLogAppender. {color}
> 5. Because the majority commit got from the FollowerInfo of the old 
> GrpcLogAppender never changes. So even though follower has append entry 
> successfully, the leader can not update commit. So the new created 
> GrpcLogAppender can never work normally.
> 6. The reason of unit test of runTestRestartLogAppender can pass is that it 
> did not stop the old GrpcLogAppender, and  the old GrpcLogAppender append 
> entry to follower, not the new GrpcLogAppender. If stop the old 
> GrpcLogAppender, runTestRestartLogAppender will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (RATIS-840) Memory leak of LogAppender

Reply via email to