[
https://issues.apache.org/jira/browse/AMBARI-19435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yusaku Sako updated AMBARI-19435:
---------------------------------
Reporter: Vivek Sharma (was: Jonathan Hurley)
> NodeManager restart fails during HOU if it is on same host as RM
> ----------------------------------------------------------------
>
> Key: AMBARI-19435
> URL: https://issues.apache.org/jira/browse/AMBARI-19435
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Affects Versions: 2.5.0
> Reporter: Vivek Sharma
> Assignee: Jonathan Hurley
> Priority: Critical
> Fix For: 2.5.0
>
> Attachments: AMBARI-19435.patch
>
>
> *Steps*
> # Deploy HDP-2.5.0.0 cluster with Ambari-2.5.0.0 - 4 node cluster with
> NodeManager installed on all hosts, NN HA is enabled, RM HA is not enabled
> # Register 2.5.3.0 version and install the bits
> # Start HOU using API and accept manual prompts to sys-prep the hosts.
> Observe the wizard at restart task of host that runs RM and NM together
> *Result:*
> At the task to Restart Node Manager on the RM host, observed below failure:
> {code}
> 2016-12-20 18:32:39,446 -
> File['/var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'action':
> ['delete'], 'not_if': 'ambari-sudo.sh -H -E test -f
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'}
> 2016-12-20 18:32:39,459 - Execute['ulimit -c unlimited; export
> HADOOP_LIBEXEC_DIR=/usr/hdp/2.5.3.0-37/hadoop/libexec &&
> /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config
> /usr/hdp/2.5.3.0-37/hadoop/conf start nodemanager'] {'not_if':
> 'ambari-sudo.sh -H -E test -f
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'user': 'yarn'}
> 2016-12-20 18:32:40,558 - Execute['ambari-sudo.sh -H -E test -f
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'not_if':
> 'ambari-sudo.sh -H -E test -f
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'tries': 5,
> 'try_sleep': 1}
> 2016-12-20 18:32:40,576 - Skipping Execute['ambari-sudo.sh -H -E test -f
> /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E
> pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] due to not_if
> 2016-12-20 18:32:40,576 - Executing NodeManager Stack Upgrade post-restart
> 2016-12-20 18:32:40,578 - NodeManager executing "yarn node -list
> -states=RUNNING" to verify the node has rejoined the cluster...
> 2016-12-20 18:32:40,578 - checked_call['yarn node -list -states=RUNNING']
> {'user': 'yarn'}
> Command failed after 1 tries
> {code}
> A retry of the failed task is successful.
> The issue looks due to the fact that RM is still down while we try to start
> NM on the host. While starting NM, we run below command to verify if NM has
> come up
> {code}
> yarn node -list -states=RUNNING
> {code}
> The command fails since it tries to connect to RM, resulting in timeout
> As a possible fix, we may need to adjust the order in HOU upgrade pack so as
> to start RM before NM in such cases.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)