[
https://issues.apache.org/jira/browse/OOZIE-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517147#comment-15517147
]
Venkat Ranganathan commented on OOZIE-2654:
-------------------------------------------
Attaching a patch directly as it is a simple fix. Sorry it took long. Have
validated in scenarios where oozie fails to stop because of the issues
mentioned here
> Zookeeper dependent services should not depend on Connectionstate to be valid
> before cleaning up
> ------------------------------------------------------------------------------------------------
>
> Key: OOZIE-2654
> URL: https://issues.apache.org/jira/browse/OOZIE-2654
> Project: Oozie
> Issue Type: Bug
> Components: HA
> Affects Versions: 4.2.0
> Reporter: Venkat Ranganathan
> Assignee: Venkat Ranganathan
> Attachments: OOZIE-2654.diff
>
>
> Currently in ZKUtils, ZKLocks and ZKJobsConcurrency services, we don't
> properly teardown the zookeeper connections when the callback was not
> received from zookeeper to change the connection state.
> We can get into this situation if the ZK session for example was closed by ZK
> before any callback was received to update the connection state. This can
> cause the oozie server in a HA mode to not terminate with one or more
> sockets in close_wait state.
> Here is an instance of this issue
> From the network connections, we have one connection still on close_wait with
> indefinite wait.
> {quote} tcp6 143 0 x.x.x.1:46710 x.x.x.2:2181 CLOSE_WAIT 4688/java off
> (0.00/0/0)
> {quote}
> From the zookeeper logs,
> {quote}
> 016-08-18 20:45:29,921 - INFO
> NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868 - Client
> attempting to establish new session at /x.x.x.1:46710 2016-08-18 20:45:29,926
> - INFO CommitProcessor:1:ZooKeeperServer@617 - Established session
> 0x1569f576843000e with negotiated timeout 40000 for client /x.x.x.1:46710
> {quote}
> and later
> {quote}
> 2016-08-18 20:46:34,008 - INFO CommitProcessor:1:NIOServerCnxn@1007 - Closed
> socket connection for client /x.x.x.1:46710 which had sessionid
> 0x1569f576843000e
> {quote}
> The fix is to not check for the connectionstate during service destroy and
> teardown the zk connections.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)