[
https://issues.apache.org/jira/browse/BIGTOP-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander van der Meij updated BIGTOP-1573:
-------------------------------------------
Description:
I have used Bigtop to generate a set of RPM's for the purpose of deploying
multi-node Hadoop clusters on CentOS 7. All the components work well, save for
one network issue.
It seems that the Hadoop daemons, when started at boot through their init
scripts, do not wait for network initialisation to complete before they
themselves are processed. As a result, when I reboot for example one of my
datanodes, the hadoop-hdfs-datanode process is started using
"localhost.localdomain" as its hostname - and it also advertises itself as such
to the ResourceManager, leading to all sort of connectivity problems in a
multi-node environment.
I first noticed this problem when, after a reboot, I saw log files being
created of the form /var/log/hadoop-hdfs-datanode-localhost.localdomain.log.
When I would restart the hdfs-datanode process using the same init scripts, the
correct /var/log/hadoop-hdfs-datanode-(FQDN).log are created.
I believe the problem is caused by the introduction of systemd in CentOS 7;
init scripts are run in parallel and there are no contraints present in the
Hadoop init scripts that instruct it to wait until network initialisation is
complete.
Now for the good news, adding $network to the Required-Start/Stop list for all
Hadoop daemons solves the issue for me:
/etc/init.d/hadoop-hdfs-datanode:
# Required-Start: $syslog $remote_fs $network
# Required-Stop: $syslog $remote_fs $network
was:
I have used Bigtop to generate a set of RPM's for the purpose of deploying
multi-node Hadoop clusters. All the components work well, save for one network
issue.
It seems that the Hadoop daemons, when started at boot through their init
scripts, do not wait for network initialisation to complete before they
themselves are processed. As a result, when I reboot for example one of my
datanodes, the hadoop-hdfs-datanode process is started using
"localhost.localdomain" as its hostname - and it also advertises itself as such
to the ResourceManager, leading to all sort of connectivity problems in a
multi-node environment.
I first noticed this problem when, after a reboot, I saw log files being
created of the form /var/log/hadoop-hdfs-datanode-localhost.localdomain.log.
When I would restart the hdfs-datanode process using the same init scripts, the
correct /var/log/hadoop-hdfs-datanode-(FQDN).log are created.
I believe the problem is caused by the introduction of systemd in CentOS 7;
init scripts are run in parallel and there are no contraints present in the
Hadoop init scripts that instruct it to wait until network initialisation is
complete.
Now for the good news, adding $network to the Required-Start/Stop list for all
Hadoop daemons solves the issue for me:
/etc/init.d/hadoop-hdfs-datanode:
# Required-Start: $syslog $remote_fs $network
# Required-Stop: $syslog $remote_fs $network
> rpm init scripts do not wait for network
> ----------------------------------------
>
> Key: BIGTOP-1573
> URL: https://issues.apache.org/jira/browse/BIGTOP-1573
> Project: Bigtop
> Issue Type: Bug
> Components: rpm
> Affects Versions: 0.8.0
> Environment: CentOS 7
> Reporter: Alexander van der Meij
> Labels: build
>
> I have used Bigtop to generate a set of RPM's for the purpose of deploying
> multi-node Hadoop clusters on CentOS 7. All the components work well, save
> for one network issue.
> It seems that the Hadoop daemons, when started at boot through their init
> scripts, do not wait for network initialisation to complete before they
> themselves are processed. As a result, when I reboot for example one of my
> datanodes, the hadoop-hdfs-datanode process is started using
> "localhost.localdomain" as its hostname - and it also advertises itself as
> such to the ResourceManager, leading to all sort of connectivity problems in
> a multi-node environment.
> I first noticed this problem when, after a reboot, I saw log files being
> created of the form /var/log/hadoop-hdfs-datanode-localhost.localdomain.log.
> When I would restart the hdfs-datanode process using the same init scripts,
> the correct /var/log/hadoop-hdfs-datanode-(FQDN).log are created.
> I believe the problem is caused by the introduction of systemd in CentOS 7;
> init scripts are run in parallel and there are no contraints present in the
> Hadoop init scripts that instruct it to wait until network initialisation is
> complete.
> Now for the good news, adding $network to the Required-Start/Stop list for
> all Hadoop daemons solves the issue for me:
> /etc/init.d/hadoop-hdfs-datanode:
> # Required-Start: $syslog $remote_fs $network
> # Required-Stop: $syslog $remote_fs $network
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)