[Bigdata-dev] [Bug 1414080] Re: Race condition on relation when bundling hdp-hadoop

Samuel Cozannet Wed, 25 Feb 2015 01:51:14 -0800

The issue is that the namenode-relation-joined hook assumes that the namenode 
service is stopped, which is not necessarily the case, especially when several 
compute-nodes are connected to the YARN master.
A solution would be to make sure the hook stops then start the namenode 
service.


See below a ref to an email sent to a user: 
So this is the story of a YARN node (based on charm hdp-hadoop-7 and 4 compute 
nodes (same charm). 
If you deploy it with multiple compute nodes at once, you get a failed relation 
namenode on the yarn-master side: 

unit-yarn-master-0[28041]: 2015-02-25 09:09:40 INFO 
unit.yarn-master/0.namenode-relation-joined logger.go:40 
subprocess.CalledProcessError: Command '['su', 'hdfs', '-c', 
'/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start 
namenode']' returned non-zero exit status 1
unit-yarn-master-0[28041]: 2015-02-25 09:09:40 ERROR juju.worker.uniter 
uniter.go:608 hook "namenode-relation-joined" failed: exit status 1

So I connected on yarn-master/0 and tried:

ubuntu@ip-172-31-42-86:~$ sudo su hdfs
hdfs@ip-172-31-42-86:/home/ubuntu$ /usr/lib/hadoop/sbin/hadoop-daemon.sh 
--config /etc/hadoop/conf start namenode
namenode running as process 9270. Stop it first.

So I did it: 
hdfs@ip-172-31-42-86:/home/ubuntu$ /usr/lib/hadoop/sbin/hadoop-daemon.sh 
--config /etc/hadoop/conf stop namenode
stopping namenode

But then when running:

juju resolved -r yarn-master/0

I would still run into the same issue. The trick is to remove -r. What happens 
is that 
* the hook is run as many times as there are compute-nodes.  
* The error comes from the hook not testing if the namenode service is already 
started or not, and trying to start it anyway instead of restarting it. 

So the fix comes with alternatively stopping namenode service, and
resolving the issue on juju client side:

On YARN side: 
hdfs@ip-172-31-42-86:/home/ubuntu$ /usr/lib/hadoop/sbin/hadoop-daemon.sh 
--config /etc/hadoop/conf stop namenode
stopping namenode

Then  (on client side)
juju resolved yarn-master/0 

Then on YARN side: 
hdfs@ip-172-31-42-86:/home/ubuntu$ /usr/lib/hadoop/sbin/hadoop-daemon.sh 
--config /etc/hadoop/conf stop namenode
stopping namenode

Then  (on client side) (!!! there is no retry !!!)
juju resolved yarn-master/0 

do that as many times as you have compute nodes (minus one for the last
time the namenode will actually start) and you'll be OK.


** Package changed: hdp-pig (Ubuntu) => hdp-pig (Juju Charms Collection)

** Changed in: hdp-pig (Juju Charms Collection)
     Assignee: (unassigned) => Juju Big Data Development (bigdata-dev)

** Changed in: hdp-hadoop (Juju Charms Collection)
     Assignee: (unassigned) => Juju Big Data Development (bigdata-dev)

-- 
You received this bug notification because you are a member of Juju Big
Data Development, which is a bug assignee.
https://bugs.launchpad.net/bugs/1414080

Title:
  Race condition on relation when bundling hdp-hadoop

Status in hdp-hadoop package in Juju Charms Collection:
  New
Status in hdp-pig package in Juju Charms Collection:
  New

Bug description:
  I build a demo to run a Machine Learning workshop based on a blog post
  made by Hortonworks.

  There are 2 sides in the demo: 
  * Bundle: https://github.com/SaMnCo/bundle-flight-delay-demo
  * Charm: https://github.com/SaMnCo/charm-flight-delay-demo

  The bundle comprises:

  * YARN Master
  * 4x compute nodes
  * PIG colocated on YARN
  * ipython-notebook colocated on YARN

  When I deploy manually using the 00-deploy file provided in the
  bundle, everything goes well. However, when trying to deploy the
  bundle, it fails at relation creation.

  In the attached juju log collected at deployment, line 11347, we see
  Hadoop crashing. Then 11402, the crash expands. 11418, we discover the
  resource manager is not ready.

  I can reproduce the same behavior with a simpler bundle comprising:

  * YARN Master
  * 4x compute nodes
  * PIG colocated on YARN

  So it seems to really be related to PIG/YARN/Compute nodes relation.

  I can also reproduce the same behavior from juju-deployer, which fails
  with :

  2015-01-23 16:51:21 [INFO] deployer.import: Adding relations...
  2015-01-23 16:51:23 [INFO] deployer.import:  Adding relation 
c00-yarn-master:namenode <-> c02-compute-node:datanode
  2015-01-23 16:51:24 [INFO] deployer.import:  Adding relation 
c00-yarn-master:resourcemanager <-> c02-compute-node:nodemanager
  2015-01-23 16:51:25 [INFO] deployer.import:  Adding relation 
c04-hdp-pig:namenode <-> c00-yarn-master:namenode
  2015-01-23 16:51:25 [INFO] deployer.import:  Adding relation 
c04-hdp-pig:resourcemanager <-> c00-yarn-master:resourcemanager
  2015-01-23 16:51:26 [INFO] deployer.import:  Adding relation 
c03-flight-delay-demo:notebook <-> c01-ipython-notebook:notebook
  2015-01-23 16:51:26 [DEBUG] deployer.import: Waiting for relation convergence 
60s
  2015-01-23 16:52:29 [ERROR] deployer.env: The following units had errors:
     unit: c00-yarn-master/0: machine: 1 agent-state: error details: hook 
failed: "namenode-relation-joined"
  2015-01-23 16:52:29 [INFO] deployer.cli: Deployment stopped. run time: 583.23

  Let me know if you need anything else! 
  Thanks!

To manage notifications about this bug go to:
https://bugs.launchpad.net/charms/+source/hdp-hadoop/+bug/1414080/+subscriptions

-- 
Mailing list: https://launchpad.net/~bigdata-dev
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~bigdata-dev
More help   : https://help.launchpad.net/ListHelp

[Bigdata-dev] [Bug 1414080] Re: Race condition on relation when bundling hdp-hadoop

Reply via email to