[ https://issues.apache.org/jira/browse/FALCON-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505041#comment-14505041 ]
Peeyush Bishnoi commented on FALCON-1165: ----------------------------------------- On analysis, I have found that this issue happen when Falcon reload cluster entities on restart and try to ensure that jar files in HDFS working lib directory should be up to date. But if HDFS service is not available on remote cluster (down due to maintenance activity), Falcon fail to restart on source cluster and log the exception. Approach to solve this issue is that, when Falcon restart we should make a check whether remote HDFS service (on cluster Z) is available or not upon reloading cluster entities. If it is not available, then we should not try to update the jars file in HDFS working lib directory. But ensure that Falcon service should start on source cluster (cluster X). With this atleast replication/processing should happen with another available remote cluster(cluster Y) from source cluster. Please provide more thoughts on this approach. > Falcon restart failed, if defined service in cluster entity is unreachable > -------------------------------------------------------------------------- > > Key: FALCON-1165 > URL: https://issues.apache.org/jira/browse/FALCON-1165 > Project: Falcon > Issue Type: Bug > Reporter: Peeyush Bishnoi > Assignee: Peeyush Bishnoi > Fix For: 0.7 > > > Falcon fail to restart, if any service in the cluster entity is not reachable > or down. > For example, if there are clusters X, Y, Z. In cluster X, submit cluster > entities which points to services of cluster Y & Z. Execute some replication > jobs from cluster X to Y and even to cluster Z as well. If after certain > duration, cluster Z HDFS service is down due to maintenance activity and at > the same time we require to restart Falcon service on cluster X due to some > reason, then Falcon will fail to restart on cluster X. > This issue has been reported internally at Hortonworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)