[
https://issues.apache.org/jira/browse/MESOS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164094#comment-14164094
]
Killian Murphy commented on MESOS-1847:
---------------------------------------
I had the same issue.
Adding --wait 600 worked for me. Adding --wait 180 did not. Testing with sshing
into the created VM after the failure looks like about 7-8 minutes before sshd
is ready for login.
The only way to recover for me was destroy and recreate with the additional
--wait option.
Here's the failure:
killian@nore ~/development/mesos/mesos-0.20.1/ec2: ./mesos_ec2.py -k kdefault
-i ~/AWS/id_rsa-kdefault -s 1 launch k_mesos
Setting up security groups...
Checking for running cluster...
Launching instances...
Launched slaves, regid = r-87bd89ac
Launched master, regid = r-65bf8b4e
Waiting for instances to start up...
Waiting 60 more seconds...
Deploying files to master...
ssh: connect to host ec2-54-237-156-217.compute-1.amazonaws.com port 22:
Connection refused
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at
/SourceCache/rsync/rsync-42/rsync/io.c(452) [sender=2.6.9]
Traceback (most recent call last):
File "./mesos_ec2.py", line 571, in <module>
main()
File "./mesos_ec2.py", line 480, in main
setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True)
File "./mesos_ec2.py", line 334, in setup_cluster
deploy_files(conn, "deploy." + opts.os, opts, master_nodes, slave_nodes,
zoo_nodes)
File "./mesos_ec2.py", line 445, in deploy_files
subprocess.check_call(command, shell=True)
File
"/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py",
line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'rsync -rv -e 'ssh -o
StrictHostKeyChecking=no -i /Users/killian/AWS/id_rsa-kdefault'
'/var/folders/8t/hp2txtm56h3byl8q5cdd33bm0000gp/T/tmp5VZqO3/'
'[email protected]:/'' returned non-zero exit
status 255
> mesos-ec2 launch: tries to rsync before ssh is available
> --------------------------------------------------------
>
> Key: MESOS-1847
> URL: https://issues.apache.org/jira/browse/MESOS-1847
> Project: Mesos
> Issue Type: Bug
> Components: ec2
> Reporter: Kevin Matzen
>
> If you don't specify a wait time that is long enough, then wait_for_cluster
> will return once the instances have launched, but ssh will not necessarily be
> available. deploy_files will execute rsync and then possibly fail. ssh
> should be tested before continuing onto the file deployment stage. It's not
> really clear to me why opts.wait is even a thing when you can simply test for
> the availability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)