Stephen M. Hopper created SPARK-2396:
----------------------------------------
Summary: Spark EC2 scripts fail when trying to log in to EC2
instances
Key: SPARK-2396
URL: https://issues.apache.org/jira/browse/SPARK-2396
Project: Spark
Issue Type: Bug
Components: EC2
Affects Versions: 1.0.0
Environment: Windows 8, Cygwin and command prompt, Python 2.7
Reporter: Stephen M. Hopper
I cannot seem to successfully start up a Spark EC2 cluster using the spark-ec2
script.
I'm using variations on the following command:
./spark-ec2 --instance-type=m1.small --region=us-west-1 --spot-price=0.05
--spark-version=1.0.0 -k my-key-name -i my-key-name.pem -s 1 launch
spark-test-cluster
The script always allocates the EC2 instances without much trouble, but can
never seem to complete the SSH step to install Spark on the cluster. It always
complains about my SSH key. If I try to log in with my ssh key doing something
like this:
ssh -i my-key-name.pem root@<insert ip of my instance here>
it fails. However, if I log in to the AWS console, click on my instance and
select "connect", it displays the instructions for SSHing into my instance
(which are no different from the ssh command from above). So, if I rerun the
SSH command from above, I'm able to log in.
Next, if I try to rerun the spark-ec2 command from above (replacing "launch"
with "start"), the script logs in and starts installing Spark. However, it
eventually errors out with the following output:
Cloning into 'spark-ec2'...
remote: Counting objects: 1465, done.
remote: Compressing objects: 100% (697/697), done.
remote: Total 1465 (delta 485), reused 1465 (delta 485)
Receiving objects: 100% (1465/1465), 228.51 KiB | 287 KiB/s, done.
Resolving deltas: 100% (485/485), done.
Connection to ec2-<my-clusters-ip>.us-west-1.compute.amazonaws.com closed.
Searching for existing cluster spark-test-cluster...
Found 1 master(s), 1 slaves
Starting slaves...
Starting master...
Waiting for instances to start up...
Waiting 120 more seconds...
Deploying files to master...
Traceback (most recent call last):
File "./spark_ec2.py", line 823, in <module>
main()
File "./spark_ec2.py", line 815, in main
real_main()
File "./spark_ec2.py", line 806, in real_main
setup_cluster(conn, master_nodes, slave_nodes, opts, False)
File "./spark_ec2.py", line 450, in setup_cluster
deploy_files(conn, "deploy.generic", opts, master_nodes, slave_nodes,
modules)
File "./spark_ec2.py", line 593, in deploy_files
subprocess.check_call(command)
File "E:\windows_programs\Python27\lib\subprocess.py", line 535, in check_call
retcode = call(*popenargs, **kwargs)
File "E:\windows_programs\Python27\lib\subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
File "E:\windows_programs\Python27\lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "E:\windows_programs\Python27\lib\subprocess.py", line 958, in
_execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
So, in short, am I missing something or is this a bug? Any help would be
appreciated.
Other notes:
-I've tried both us-west-1 and us-east-1 regions.
-I've tried several different instance types.
-I've tried playing with the permissions on the ssh key (600, 400, etc.), but
to no avail
--
This message was sent by Atlassian JIRA
(v6.2#6252)