[
https://issues.apache.org/jira/browse/ZOOKEEPER-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740161#comment-16740161
]
Michael Han commented on ZOOKEEPER-3240:
----------------------------------------
[~nixon] Good catch, the fix looks reasonable.
I've seen similar issue in my production environment, the fix I did was on
Leader side where I tracked the LearnerHandler threads associated with server
ids, and make sure each server id only has a single LearnerHandler thread. This
also work in cases where the learners don't have a chance to close their
sockets, or they did but due to some reasons the TCP reset never made it to
leader. But in any case, it's good to fix the resource leaking on learner side.
I also wonder why we could get into such case on Leader side in first place. On
leader, we do have socket read timeout set via setSoTimeout for leaner handler
threads (after the socket was created via serverSocket.accept), and each
learner handler would constantly polling / trying read from the socket
afterwards. If, on a learner it dies but left a valid socket open, I was
expecting one leader side the LearnerHandler thread that trying to read from
that died learner socket will eventually timeout, which, will throw
SocketTimeOutException and cause the LearnerHandler thread on the leader kill
itself. This though does not seem to be the case I observed. Do you have any
insights on this?
> Close socket on Learner shutdown to avoid dangling socket
> ---------------------------------------------------------
>
> Key: ZOOKEEPER-3240
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3240
> Project: ZooKeeper
> Issue Type: Improvement
> Components: server
> Affects Versions: 3.6.0
> Reporter: Brian Nixon
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> There was a Learner that had two connections to the Leader after that Learner
> hit an unexpected exception during flush txn to disk, which will shutdown
> previous follower instance and restart a new one.
>
> {quote}2018-10-26 02:31:35,568 ERROR
> [SyncThread:3:ZooKeeperCriticalThread@48] - Severe unrecoverable error, from
> thread : SyncThread:3
> java.io.IOException: Input/output error
> at java.base/sun.nio.ch.FileDispatcherImpl.force0(Native Method)
> at
> java.base/sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:72)
> at
> java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:395)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:457)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:548)
> at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:769)
> at
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:246)
> at
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:172)
> 2018-10-26 02:31:35,568 INFO [SyncThread:3:ZooKeeperServerListenerImpl@42] -
> Thread SyncThread:3 exits, error code 1
> 2018-10-26 02:31:35,568 INFO [SyncThread:3:SyncRequestProcessor@234] -
> SyncRequestProcessor exited!{quote}
>
> It is supposed to close the previous socket, but it doesn't seem to be done
> anywhere in the code. This leaves the socket open with no one reading from
> it, and caused the queue full and blocked on sender.
>
> Since the LearnerHandler didn't shutdown gracefully, the learner queue size
> keeps growing, the JVM heap size on leader keeps growing and added pressure
> to the GC, and cause high GC time and latency in the quorum.
>
> The simple fix is to gracefully shutdown the socket.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)