[ 
https://issues.apache.org/jira/browse/FALCON-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505041#comment-14505041
 ] 

Peeyush Bishnoi commented on FALCON-1165:
-----------------------------------------

On analysis, I have found that this issue happen when Falcon reload cluster 
entities on restart and try to ensure that jar files in HDFS working lib 
directory should be up to date. But if HDFS service is not available on remote 
cluster (down due to maintenance activity), Falcon fail to restart on source 
cluster and log the exception.

Approach to solve this issue is that, when Falcon restart we should make a 
check whether remote HDFS service (on cluster Z) is available or not upon 
reloading cluster entities. If it is not available, then we should not try to 
update the jars file in HDFS working lib directory. But ensure that Falcon 
service should start on source cluster (cluster X). With this atleast 
replication/processing should happen with another available remote 
cluster(cluster Y) from source cluster. Please provide more thoughts on this 
approach. 

> Falcon restart failed, if defined service in cluster entity is unreachable
> --------------------------------------------------------------------------
>
>                 Key: FALCON-1165
>                 URL: https://issues.apache.org/jira/browse/FALCON-1165
>             Project: Falcon
>          Issue Type: Bug
>            Reporter: Peeyush Bishnoi
>            Assignee: Peeyush Bishnoi
>             Fix For: 0.7
>
>
> Falcon fail to restart, if any service in the cluster entity is not reachable 
> or down.
> For example, if there are clusters X, Y, Z. In cluster X, submit cluster 
> entities which points to services of cluster Y & Z. Execute some replication 
> jobs from cluster X to Y and even to cluster Z as well. If after certain 
> duration, cluster Z HDFS service is down due to maintenance activity and at 
> the same time we require to restart Falcon service on cluster X due to some 
> reason, then Falcon will fail to restart on cluster X. 
> This issue has been reported internally at Hortonworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to