[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740161#comment-16740161
 ] 

Michael Han commented on ZOOKEEPER-3240:
----------------------------------------

[~nixon] Good catch, the fix looks reasonable. 

I've seen similar issue in my production environment, the fix I did was on 
Leader side where I tracked the LearnerHandler threads associated with server 
ids, and make sure each server id only has a single LearnerHandler thread. This 
also work in cases where the learners don't have a chance to close their 
sockets, or they did but due to some reasons the TCP reset never made it to 
leader. But in any case, it's good to fix the resource leaking on learner side.

I also wonder why we could get into such case on Leader side in first place. On 
leader, we do have socket read timeout set via setSoTimeout for leaner handler 
threads (after the socket was created via serverSocket.accept), and each 
learner handler would constantly polling / trying read from the socket 
afterwards. If, on a learner it dies but left a valid socket open, I was 
expecting one leader side the LearnerHandler thread that trying to read from 
that died learner socket will eventually timeout, which, will throw 
SocketTimeOutException and cause the LearnerHandler thread on the leader kill 
itself. This though does not seem to be the case I observed. Do you have any 
insights on this?

> Close socket on Learner shutdown to avoid dangling socket
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-3240
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3240
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.6.0
>            Reporter: Brian Nixon
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> There was a Learner that had two connections to the Leader after that Learner 
> hit an unexpected exception during flush txn to disk, which will shutdown 
> previous follower instance and restart a new one.
>  
> {quote}2018-10-26 02:31:35,568 ERROR 
> [SyncThread:3:ZooKeeperCriticalThread@48] - Severe unrecoverable error, from 
> thread : SyncThread:3
> java.io.IOException: Input/output error
>         at java.base/sun.nio.ch.FileDispatcherImpl.force0(Native Method)
>         at 
> java.base/sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:72)
>         at 
> java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:395)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:457)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:548)
>         at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:769)
>         at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:246)
>         at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:172)
> 2018-10-26 02:31:35,568 INFO  [SyncThread:3:ZooKeeperServerListenerImpl@42] - 
> Thread SyncThread:3 exits, error code 1
> 2018-10-26 02:31:35,568 INFO [SyncThread:3:SyncRequestProcessor@234] - 
> SyncRequestProcessor exited!{quote}
>  
> It is supposed to close the previous socket, but it doesn't seem to be done 
> anywhere in the code. This leaves the socket open with no one reading from 
> it, and caused the queue full and blocked on sender.
>  
> Since the LearnerHandler didn't shutdown gracefully, the learner queue size 
> keeps growing, the JVM heap size on leader keeps growing and added pressure 
> to the GC, and cause high GC time and latency in the quorum.
>  
> The simple fix is to gracefully shutdown the socket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to