Hi dev,
I just hit another problem with the ansible script for mesos-deployment. This
issue is related to creating instances in ec2 using the ansible playbook. The
fix is mentioned later below.
In particular, when you run the command (which would spin up 4 machines in EC2):
ansible-playbook -i hosts site.yml -t "ec2"
you might see the below authentication error:
TASK [ec2 : create a aws instace/s] ********************************************
failed: [localhost] (item=gs-mesos-master-1) => {"failed": true, "item":
"gs-mesos-master-1", "msg": "No handler was ready to authenticate. 1 handlers
were checked. ['HmacAuthV4Handler'] Check your credentials"}
failed: [localhost] (item=gs-mesos-master-2) => {"failed": true, "item":
"gs-mesos-master-2", "msg": "No handler was ready to authenticate. 1 handlers
were checked. ['HmacAuthV4Handler'] Check your credentials"}
failed: [localhost] (item=gs-mesos-master-3) => {"failed": true, "item":
"gs-mesos-master-3", "msg": "No handler was ready to authenticate. 1 handlers
were checked. ['HmacAuthV4Handler'] Check your credentials"}
failed: [localhost] (item=gs-mesos-slave-1) => {"failed": true, "item":
"gs-mesos-slave-1", "msg": "No handler was ready to authenticate. 1 handlers
were checked. ['HmacAuthV4Handler'] Check your credentials"}
This is because the ansible playbook is not able to authenticate the user, even
if you have updated the “roles/ec2/vars/aws-credential.yml” file with your AWS
access & secret keys.
I was able to resolve this issue by adding the following (highlighted in
yellow) to “roles/ec2/tasks/main.yml” file – which runs the task of creating
the EC2 instances.
- name: create a aws instace/s
ec2:
aws_access_key: "{{aws_access_key}}"
aws_secret_key: "{{aws_secret_key}}"
key_name: "{{ key_name }}"
region: us-east-1
Basically, this ansible task had no way of knowing the user credentials when it
tried to create the instance(s), hence the error. Hope this helps!
@Shameera,
Is this a valid fix? If yes, could you update the ansible script? Thanks in
advance.
Thanks and Regards,
Gourav Shenoy
From: Suresh Marru <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, September 16, 2016 at 11:02 PM
To: Airavata Dev <[email protected]>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling
Hi Gourav,
Thank you for this excellent communication. Hope others will follow suite on
such mailing lists updates. When you post such nontrivial diagnosis to the
mailing lists, others having trouble will be able to search on this thread and
follow these to fix.
Hoping to see lot more dev list threads similar to this one.
Suresh
On Sep 16, 2016, at 10:16 PM, Shenoy, Gourav Ganesh
<[email protected]<mailto:[email protected]>> wrote:
Hi dev,
I finally managed to get the mesos-marathon cluster up & running using the
Ansible script. There were couple of issues because of which things were
failing. I have listed the problems faced during installation & the solutions
that fixed things for me.
1. Marathon was not getting installed – This is because Marathon just released
a new build (version: 1.3.0-1.0.506.el7) 2 days back and apparently the RPM for
this version is corrupt, and hence a plain “yum install marathon” fails. To get
around, I listed all versions of marathon present in the repository.
# yum --showduplicates list marathon | expand
marathon.x86_64 1.1.3-1.0.503.el7 mesosphere
marathon.x86_64 1.3.0-1.0.506.el7 mesosphere
The next latest version was 1.1.3-1.0.503.el7 which seemed stable, and hence I
updated the ansible task to use this version for marathon.
In “roles/mesos-master/tasks/main.yml” I updated the following:
- name: install mesos and marathon
yum:
name: "{{ item }}"
with_items:
- mesos
- marathon-1.1.3-1.0.503.el7
The mesos-marathon cluster installation worked perfectly fine after this change.
2. Even after this, the command “mesos-resolve `cat /etc/mesos/zk`” was
failing with the error Failed to obtain the IP address for 'ip-172-30-1-197';
the DNS service may not be able to resolve it: Name or service not known
Apparently it couldn’t resolve the hostname for the local master machine. I
resolved this issue by adding a host entry in each master node.
Eg: On master node which threw above error, I added the host entry (/etc/hosts):
172.30.1.197 ip-172-30-1-197
After this I was able to get the master-ip and visit the mesos dashboard
(master-ip:5050)
3. I noticed that although I was able to view the mesos dashboard, I
couldn’t access the marathon dashboard. The connection to <master-ip>:8080 was
getting refused. I then restarted the marathon service on the master node –
sudo service marathon restart. After this I was able to access the marathon
dashboard as well.
Hope this helps!
Thanks and Regards,
Gourav Shenoy
From: "Shenoy, Gourav Ganesh"
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Friday, September 16, 2016 at 3:52 PM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling
Hi Shameera,
As discussed, after commenting out the “marathon” section the ansible playbooks
execute without errors. But when I try to get the master-ip using
“mesos-resolve”, I get an error:
I SSH’ed into one of the master machine and tried to check the status of the
mesos-master service, seems like the service is in failed state. See the trace
below:
[centos@ip-172-30-1-39 ~]$ sudo service mesos-master status
Redirecting to /bin/systemctl status mesos-master.service
● mesos-master.service - Mesos Master
Loaded: loaded (/usr/lib/systemd/system/mesos-master.service; enabled;
vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Fri 2016-09-16
19:46:37 UTC; 18s ago
Process: 12608 ExecStart=/usr/bin/mesos-init-wrapper master (code=exited,
status=1/FAILURE)
Main PID: 12608 (code=exited, status=1/FAILURE)
Sep 16 19:46:37 ip-172-30-1-39 systemd[1]: Unit mesos-master.service entered
failed state.
Sep 16 19:46:37 ip-172-30-1-39 systemd[1]: mesos-master.service failed.
Hope this helps debugging the problem.
Thanks and Regards,
Gourav Shenoy
From: Suresh Marru <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Friday, September 16, 2016 at 9:30 AM
To: Airavata Dev <[email protected]<mailto:[email protected]>>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling
Hi Shameera,
All of these are great directions for Airavata, thank you for pushing the
Ansible and Mesos deployments on the clouds. I think it will be better if we
get your scripts into Airavata repo and all of us collectively work on it.
Looks like atleast Pankaj and Gourav will also be able to contribution in
addition to you.
Suresh
On Sep 15, 2016, at 8:59 PM, Shenoy, Gourav Ganesh
<[email protected]<mailto:[email protected]>> wrote:
Sure, thanks Shameera. I will try this.
Thanks and Regards,
Gourav Shenoy
From: Shameera Rathnayaka
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Thursday, September 15, 2016 at 8:55 PM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling
Interesting, I am also getting the same issue. The same script worked perfectly
yesterday. I doubt some issue with marathon rpm. By removing marathon
installation Mesos get installed without any issue.
to remove marathon installation do following to
/roles/mesos-master/tasks/main.yml file.
1. comment marathon in "install mesos and marathon" task
2. comment the last task which start marathon
Meanwhile, i will try to find exact reason.
~ Shameera.
On Thu, Sep 15, 2016 at 8:32 PM Shenoy, Gourav Ganesh
<[email protected]<mailto:[email protected]>> wrote:
Hi Shameera,
I am using the same image which you used (centos_ami_7_2: ami-6d1c2007).
Thanks and Regards,
Gourav Shenoy
From: Shameera Rathnayaka
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Thursday, September 15, 2016 at 8:26 PM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling
Hi Gourav,
According to the error, something have happened while unpacking marathon
bundle, see: Installing : marathon-1.3.0-1.0.506.el7.x86_64
1/1 \nerror: unpacking of archive failed on file
/usr/bin/marathon;57daffff: cpio: read\n Verifying :
marathon-1.3.0-1.0.506.el7.x86_64 1/1 \n\nFailed:\n
marathon.x86_64 0:1.3.0-1.0.506.el7
What OS image and version you used to create instances? I tested with centos
7.2 and it works fine.
~ Shameera.
On Thu, Sep 15, 2016 at 8:14 PM Shenoy, Gourav Ganesh
<[email protected]<mailto:[email protected]>> wrote:
Hi Shameera,
I am trying to build a mesos cluster on EC2 using your playbooks. But I am
facing some issues. Please find the details below:
Details:
- I created 4 instances on EC2 (us-east-1 region) using the
cloud-provisioning module (CloudBridge python). Out of the 4, 3 were meant to
be mesos masters & 1 slave.
Note: The instance inbound & outbount traffic is wideopen.
- I skipped step-1 & step-2 in your README, since I manually
provisioned the instances. Next, I updated “hosts” file with public IPs for all
4 instances. And also updated the “roles/zookeeper/vars/main.yml” file with the
private IPs of 3 master instances.
- I executed the “ansible-playbook -i hosts site.yml -t
"mesos-master"” command, and I get the following error:
TASK [mesos-master : install firewalld] ****************************************
ok: [52.91.152.1]
ok: [52.87.235.79]
ok: [54.167.94.186]
TASK [mesos-master : start firewalld] ******************************************
ok: [52.91.152.1]
ok: [52.87.235.79]
ok: [54.167.94.186]
TASK [mesos-master : open ports] ***********************************************
ok: [52.91.152.1] => (item=5050/tcp)
ok: [52.87.235.79] => (item=5050/tcp)
ok: [54.167.94.186] => (item=5050/tcp)
ok: [52.87.235.79] => (item=8080/tcp)
ok: [54.167.94.186] => (item=8080/tcp)
ok: [52.91.152.1] => (item=8080/tcp)
TASK [mesos-master : install utility - TODO delete this] ***********************
ok: [52.91.152.1] => (item=[u'vim'])
ok: [52.87.235.79] => (item=[u'vim'])
ok: [54.167.94.186] => (item=[u'vim'])
TASK [mesos-master : add mesosphere rpm] ***************************************
ok: [52.91.152.1]
ok: [52.87.235.79]
ok: [54.167.94.186]
TASK [mesos-master : install mesos and marathon] *******************************
failed: [52.91.152.1] (item=[u'mesos', u'marathon']) => {"changed": true,
"failed": true, "item": ["mesos", "marathon"], "msg": "Error unpacking rpm
package marathon-1.3.0-1.0.506.el7.x86_64\n", "rc": 1, "results": ["All
packages providing mesos are up to date", "Loaded plugins:
fastestmirror\nLoading mirror speeds from cached hostfile\n * base:
mirrors.tripadvisor.com<http://mirrors.tripadvisor.com/>\n * extras:
centos.hostingxtreme.com<http://centos.hostingxtreme.com/>\n * updates:
mirrors.greenmountainaccess.net<http://mirrors.greenmountainaccess.net/>\nResolving
Dependencies\n--> Running transaction check\n---> Package marathon.x86_64
0:1.3.0-1.0.506.el7 will be installed\n--> Finished Dependency
Resolution\n\nDependencies
Resolved\n\n================================================================================\n
Package Arch Version Repository
Size\n================================================================================\nInstalling:\n
marathon x86_64 1.3.0-1.0.506.el7 mesosphere 17
M\n\nTransaction
Summary\n================================================================================\nInstall
1 Package\n\nTotal download size: 17 M\nInstalled size: 89 M\nDownloading
packages:\nRunning transaction check\nRunning transaction test\nTransaction
test succeeded\nRunning transaction\n Installing :
marathon-1.3.0-1.0.506.el7.x86_64 1/1 \nerror:
unpacking of archive failed on file /usr/bin/marathon;57daffff: cpio: read\n
Verifying : marathon-1.3.0-1.0.506.el7.x86_64 1/1
\n\nFailed:\n marathon.x86_64 0:1.3.0-1.0.506.el7
\n\nComplete!\n"]}
failed: [52.87.235.79] (item=[u'mesos', u'marathon']) => {"changed": true,
"failed": true, "item": ["mesos", "marathon"], "msg": "Error unpacking rpm
package marathon-1.3.0-1.0.506.el7.x86_64\n", "rc": 1, "results": ["All
packages providing mesos are up to date", "Loaded plugins:
fastestmirror\nLoading mirror speeds from cached hostfile\n * base:
mirrors.tripadvisor.com<http://mirrors.tripadvisor.com/>\n * extras:
mirrors.evowise.com<http://mirrors.evowise.com/>\n * updates:
mirrors.greenmountainaccess.net<http://mirrors.greenmountainaccess.net/>\nResolving
Dependencies\n--> Running transaction check\n---> Package marathon.x86_64
0:1.3.0-1.0.506.el7 will be installed\n--> Finished Dependency
Resolution\n\nDependencies
Resolved\n\n================================================================================\n
Package Arch Version Repository
Size\n================================================================================\nInstalling:\n
marathon x86_64 1.3.0-1.0.506.el7 mesosphere 17
M\n\nTransaction
Summary\n================================================================================\nInstall
1 Package\n\nTotal download size: 17 M\nInstalled size: 89 M\nDownloading
packages:\nRunning transaction check\nRunning transaction test\nTransaction
test succeeded\nRunning transaction\n Installing :
marathon-1.3.0-1.0.506.el7.x86_64 1/1 \nerror:
unpacking of archive failed on file /usr/bin/marathon;57daffff: cpio: read\n
Verifying : marathon-1.3.0-1.0.506.el7.x86_64 1/1
\n\nFailed:\n marathon.x86_64 0:1.3.0-1.0.506.el7
\n\nComplete!\n"]}
failed: [54.167.94.186] (item=[u'mesos', u'marathon']) => {"changed": true,
"failed": true, "item": ["mesos", "marathon"], "msg": "Error unpacking rpm
package marathon-1.3.0-1.0.506.el7.x86_64\n", "rc": 1, "results": ["All
packages providing mesos are up to date", "Loaded plugins:
fastestmirror\nLoading mirror speeds from cached hostfile\n * base:
mirrors.tripadvisor.com<http://mirrors.tripadvisor.com/>\n * extras:
mirrors.evowise.com<http://mirrors.evowise.com/>\n * updates:
mirrors.greenmountainaccess.net<http://mirrors.greenmountainaccess.net/>\nResolving
Dependencies\n--> Running transaction check\n---> Package marathon.x86_64
0:1.3.0-1.0.506.el7 will be installed\n--> Finished Dependency
Resolution\n\nDependencies
Resolved\n\n================================================================================\n
Package Arch Version Repository
Size\n================================================================================\nInstalling:\n
marathon x86_64 1.3.0-1.0.506.el7 mesosphere 17
M\n\nTransaction
Summary\n================================================================================\nInstall
1 Package\n\nTotal download size: 17 M\nInstalled size: 89 M\nDownloading
packages:\nRunning transaction check\nRunning transaction test\nTransaction
test succeeded\nRunning transaction\n Installing :
marathon-1.3.0-1.0.506.el7.x86_64 1/1 \nerror:
unpacking of archive failed on file /usr/bin/marathon;57daffff: cpio: read\n
Verifying : marathon-1.3.0-1.0.506.el7.x86_64 1/1
\n\nFailed:\n marathon.x86_64 0:1.3.0-1.0.506.el7
\n\nComplete!\n"]}
NO MORE HOSTS LEFT *************************************************************
RUNNING HANDLER [zookeeper : restart zookeeper] ********************************
[WARNING]: Could not create retry file 'site.retry'. [Errno 2] No such
file or directory: ''
PLAY RECAP *********************************************************************
52.87.235.79 : ok=17 changed=2 unreachable=0 failed=1
52.91.152.1 : ok=17 changed=2 unreachable=0 failed=1
54.167.94.186 : ok=17 changed=2 unreachable=0 failed=1
localhost : ok=1 changed=0 unreachable=0 failed=0
Is there some step that I am missing? It looks like the instances are not able
to communicate because of the firewall? This is just a wild guess. Any help
here is appreciated.
Thanks and Regards,
Gourav Shenoy
From: Shameera Rathnayaka
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Monday, September 12, 2016 at 11:19 AM
To: dev <[email protected]<mailto:[email protected]>>
Subject: Spinup Mesos-Marathon Cluster for Hybrid Scheduling
Hi Dev,
The effort of getting use Cloud infrastructure to run MPI and BigData jobs
using Airavata, we use Apache Mesos as resource allocation framework to manage
different type of clusters (i.e HPC node cluster to run MPI jobs and spark,
Hadoop big data clusters to run bigdata applications). I came up with Ansible
script to spin up Mesos cluster on the target set of nodes. You can find the
script herehttps://github.com/shamrath/mesos-deployment I am thinking of move
this code to Airavata if all agreed. I would happy to answer any question
related to this.
Thanks,
Shameera.
--
Shameera Rathnayaka
--
Shameera Rathnayaka
--
Shameera Rathnayaka