[ 
https://issues.apache.org/jira/browse/HDFS-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257477#comment-16257477
 ] 

Bharat Viswanadham commented on HDFS-12834:
-------------------------------------------

[~brahmareddy]
Just want to know any advantage of this approach, instead of the patch?

> DFSZKFailoverController on error exits with 0 error code
> --------------------------------------------------------
>
>                 Key: HDFS-12834
>                 URL: https://issues.apache.org/jira/browse/HDFS-12834
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.7.3, 3.0.0-alpha4
>            Reporter: Zbigniew Kostrzewa
>            Assignee: Bharat Viswanadham
>         Attachments: HDFS-12834.00.patch
>
>
> On error {{DFSZKFailoverController}} exits with 0 return code which leads to 
> problems when integrating it with scripts and monitoring tools, e.g. systemd, 
> which when configured to restart the service only on failure does not restart 
> ZKFC because it exited with 0.
> For example, in my case, systemd reported zkfc exited with success but in 
> logs I have found this:
> {noformat}
> 2017-11-14 05:33:55,075 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 3334ms for sessionid 
> 0x15fb794bd240001, closing socket connection and attempting reconnect
> 2017-11-14 05:33:55,178 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Session disconnected. Entering neutral mode...
> 2017-11-14 05:33:55,564 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server 10.9.4.73/10.9.4.73:2182. Will not attempt to 
> authenticate using SASL (unknown error)
> 2017-11-14 05:33:55,566 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to 10.9.4.73/10.9.4.73:2182, initiating session
> 2017-11-14 05:33:55,569 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server 10.9.4.73/10.9.4.73:2182, sessionid = 
> 0x15fb794bd240001, negotiated timeout = 5000
> 2017-11-14 05:33:55,570 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Session connected.
> 2017-11-14 05:33:58,230 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
> additional data from server sessionid 0x15fb794bd240001, likely server has 
> closed socket, closing socket connection and attempting reconnect
> 2017-11-14 05:33:58,335 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Session disconnected. Entering neutral mode...
> 2017-11-14 05:33:58,402 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server 10.9.4.138/10.9.4.138:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2017-11-14 05:33:58,403 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to 10.9.4.138/10.9.4.138:2181, initiating session
> 2017-11-14 05:33:58,406 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
> additional data from server sessionid 0x15fb794bd240001, likely server has 
> closed socket, closing socket connection and attempting reconnect
> 2017-11-14 05:33:59,218 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server 10.9.4.228/10.9.4.228:2183. Will not attempt to 
> authenticate using SASL (unknown error)
> 2017-11-14 05:33:59,219 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to 10.9.4.228/10.9.4.228:2183, initiating session
> 2017-11-14 05:33:59,221 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
> additional data from server sessionid 0x15fb794bd240001, likely server has 
> closed socket, closing socket connection and attempting reconnect
> 2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server 10.9.4.73/10.9.4.73:2182. Will not attempt to 
> authenticate using SASL (unknown error)
> 2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 1773ms for sessionid 
> 0x15fb794bd240001, closing socket connection and attempting reconnect
> 2017-11-14 05:34:01,196 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
> Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
> znode monitoring connection errors.
> 2017-11-14 05:34:02,153 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x15fb794bd240001 closed
> 2017-11-14 05:34:02,154 FATAL org.apache.hadoop.ha.ZKFailoverController: 
> Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. 
> Not retrying further znode monitoring connection errors.
> 2017-11-14 05:34:02,154 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2017-11-14 05:34:05,208 INFO org.apache.hadoop.ipc.Server: Stopping server on 
> 8019
> 2017-11-14 05:34:05,487 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server listener on 8019
> 2017-11-14 05:34:05,488 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server Responder
> 2017-11-14 05:34:05,487 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Yielding from election
> 2017-11-14 05:34:05,488 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
> HealthMonitor thread
> 2017-11-14 05:34:05,490 FATAL 
> org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Got a fatal error, 
> exiting now
> java.lang.RuntimeException: ZK Failover Controller failed: Received stat 
> error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode 
> monitoring connection errors.
>         at 
> org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369)
>         at 
> org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238)
>         at 
> org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
>         at 
> org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
>         at 
> org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
>         at 
> org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
>         at 
> org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)
> {noformat}
> The code that seems responsible is in {{DFSZKFailoverController.java}}:
> {code}
>   public static void main(String args[])
>       throws Exception {
> ...
>     int retCode = 0;
>     try {
>       retCode = zkfc.run(parser.getRemainingArgs());
>     } catch (Throwable t) {
>       LOG.fatal("Got a fatal error, exiting now", t); 
>     }   
>     System.exit(retCode);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to