Shubham Sharma created HAWQ-1504:
------------------------------------
Summary: Namenode hangs during restart of docker environment
configured using incubator-hawq/contrib/hawq-docker/
Key: HAWQ-1504
URL: https://issues.apache.org/jira/browse/HAWQ-1504
Project: Apache HAWQ
Issue Type: Bug
Components: Command Line Tools
Reporter: Shubham Sharma
Assignee: Radar Lei
After setting up an environment using instructions provided under
incubator-hawq/contrib/hawq-docker/, while trying to restart docker containers
namenode hangs and tries a namenode -format during every start.
Steps to reproduce this issue -
- Navigate to incubator-hawq/contrib/hawq-docker
- make stop
- make start
- docker exec -it centos7-namenode bash
- ps -ef | grep java
You can see namenode -format running.
{code}
[gpadmin@centos7-namenode data]$ ps -ef | grep java
hdfs 11 10 1 00:56 ? 00:00:06
/etc/alternatives/java_sdk/bin/java -Dproc_namenode -Xmx1000m
-Dhdfs.namenode=centos7-namenode -Dhadoop.log.dir=/var/log/hadoop/hdfs
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.5.0.0-1245/hadoop
-Dhadoop.id.str= -Dhadoop.root.logger=INFO,console
-Djava.library.path=:/usr/hdp/2.5.0.0-1245/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.5.0.0-1245/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
-Dhadoop.security.logger=INFO,NullAppender
org.apache.hadoop.hdfs.server.namenode.NameNode -format
{code}
Since namenode -format runs in interactive mode and at this stage it is waiting
for a (Yes/No) response, the namenode will remain stuck forever. This makes
hdfs unavailable.
Root cause of the problem -
In the dockerfiles present under
incubator-hawq/contrib/hawq-docker/centos6-docker/hawq-test and
incubator-hawq/contrib/hawq-docker/centos7-docker/hawq-test, the docker
directive ENTRYPOINT executes entrypoin.sh during startup.
The entrypoint.sh in turn executes start-hdfs.sh. start-dfs.sh checks for the
following -
{code}
if [ ! -d /tmp/hdfs/name/current ]; then
su -l hdfs -c "hdfs namenode -format"
fi
{code}
My assumption is it looks for fsimage and edit logs. If they are not present
the script assumes that this a first time initialization and namenode format
should be done. However, path /tmp/hdfs/name/current does not exist on
namenode.
>From namenode logs it is clear that fsimage and edit logs are written under
>/tmp/hadoop-hdfs/dfs/name/current.
{code}
2017-07-18 00:55:20,892 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: No
edit log streams selected.
2017-07-18 00:55:20,893 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
Planning to load image:
FSImageFile(file=/tmp/hadoop-hdfs/dfs/name/current/fsimage_0000000000000000000,
cpktTxId=0000000000000000000)
2017-07-18 00:55:20,995 INFO
org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Loading 1 INodes.
2017-07-18 00:55:21,064 INFO
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Loaded FSImage in
0 seconds.
2017-07-18 00:55:21,065 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
Loaded image for txid 0 from
/tmp/hadoop-hdfs/dfs/name/current/fsimage_0000000000000000000
2017-07-18 00:55:21,084 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image?
false (staleImage=false, haEnabled=false, isRollingUpgrade=false)
2017-07-18 00:55:21,084 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Starting log segment at 1
{code}
Thus wrong path in
incubator-hawq/contrib/hawq-docker/centos*-docker/hawq-test/start-hdfs.sh
causes namenode to hang during each restart of the containers making hdfs
unavailable.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)