[jira] [Updated] (SPARK-10191) spark-ec2 cannot stop running cluster

Ruofan Kong (JIRA) Mon, 24 Aug 2015 20:28:01 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ruofan Kong updated SPARK-10191:
--------------------------------
    Description: 
Using command spark-ec2, I've created a cluster with name 
"ruofan-large-cluster" within a virtual private cloud (vpc) on AWS-EC2. The 
cluster contains one master and two slave nodes, and it works very well. Now I 
would like to stop the cluster for a while, and then restart it. However, when 
I type the bash command as follow:

{code}
$ ./spark-ec2 --region=us-east-1 stop ruofan-large-cluster
{code}

It showed up the following output:

{code}
Are you sure you want to stop the cluster ruofan-large-cluster?
DATA ON EPHEMERAL DISKS WILL BE LOST, BUT THE CLUSTER WILL KEEP USING SPACE ON
AMAZON EBS IF IT IS EBS-BACKED!!
All data on spot-instance slaves will be lost.
Stop cluster ruofan-large-cluster (y/N): y
Searching for existing cluster ruofan-large-cluster in region us-east-1...
Stopping master...
Stopping slaves...
{code}

It didn't stop the cluster at all... I'm sure the information including both my 
cluster name and cluster region are both correct, and I also tried the 
following command to stop the cluster:

{code}
$ ./spark-ec2 -k <key-file-name> -i <key-file> -r us-east-1 
--vpc-id=<my-vpc-id> --subnet-id=<my-subnet-id> stop ruofan-large-cluster
{code}

It still showed the same output, and it didn't stop any cluster. So I spent 
several hours on this problem, and I think the official Spark code spark-ec2.py 
may have a bug for identifying cluster name so I can't stop clusters. I am 
using spark-1.4.0, and in most of cases, spark-ec2.py works very well if I 
directly launch clusters on AWS without subnet of vpc. However, if I launch my 
cluster on a subnet of a vpc on AWS, then spark-ec2.py is unable to find the 
cluster so I can't stop the cluster. Specifically, in spark-ec2.py, there is a 
small segment of code as below:

{code}
conn = ec2.connect_to_region(opts.region)
{code}

Whenever we do actions such as launch, login, stop or destroy the cluster, 
spark-ec2 will firstly connect to the specified region using the above code, 
and then gets all satisfied instances by reservations 
=conn.get_all_reservations(filter=\{some conditions\}). It works very well if I 
directly launch my cluster without subnet of vpc. If my cluster is in the 
subnet of vpc, then conn.get_all_reservations() gets nothing. Just now I 
modified the original code with `conn = ec2.connect_to_region(opts.region, 
aws_access_key_id="my_aws_access_key_id", 
aws_secret_access_key="my_aws_secret_access_key"), and everything such as stop, 
login, destroy, etc. works perfect. I am wondering if we can do some changes on 
the spark code.

  was:
Using command spark-ec2, I've created a cluster with name 
"ruofan-large-cluster" within a virtual private cloud (vpc) on AWS-EC2. The 
cluster contains one master and two slave nodes, and it works very well. Now I 
would like to stop the cluster for a while, and then restart it. However, when 
I type the bash command as follow:

{code}
$ ./spark-ec2 --region=us-east-1 stop ruofan-large-cluster
{code}

It showed up the following output:

{code}
Are you sure you want to stop the cluster ruofan-large-cluster?
DATA ON EPHEMERAL DISKS WILL BE LOST, BUT THE CLUSTER WILL KEEP USING SPACE ON
AMAZON EBS IF IT IS EBS-BACKED!!
All data on spot-instance slaves will be lost.
Stop cluster ruofan-large-cluster (y/N): y
Searching for existing cluster ruofan-large-cluster in region us-east-1...
Stopping master...
Stopping slaves...
{code}

It didn't stop the cluster at all... I'm sure the information including both my 
cluster name and cluster region are both correct, and I also tried the 
following command to stop the cluster:

{code: bash}
$ ./spark-ec2 -k <key-file-name> -i <key-file> -r us-east-1 
--vpc-id=<my-vpc-id> --subnet-id=<my-subnet-id> stop ruofan-large-cluster
{code}

It still showed the same output, and it didn't stop any cluster. So I spent 
several hours on this problem, and I think the official Spark code spark-ec2.py 
may have a bug for identifying cluster name so I can't stop clusters. I am 
using spark-1.4.0, and in most of cases, spark-ec2.py works very well if I 
directly launch clusters on AWS without subnet of vpc. However, if I launch my 
cluster on a subnet of a vpc on AWS, then spark-ec2.py is unable to find the 
cluster so I can't stop the cluster. Specifically, in spark-ec2.py, there is a 
small segment of code as below:

{code:python}
conn = ec2.connect_to_region(opts.region)
{code}

Whenever we do actions such as launch, login, stop or destroy the cluster, 
spark-ec2 will firstly connect to the specified region using the above code, 
and then gets all satisfied instances by reservations 
=conn.get_all_reservations(filter={{some conditions}}). It works very well if I 
directly launch my cluster without subnet of vpc. If my cluster is in the 
subnet of vpc, then conn.get_all_reservations() gets nothing. Just now I 
modified the original code with `conn = ec2.connect_to_region(opts.region, 
aws_access_key_id="my_aws_access_key_id", 
aws_secret_access_key="my_aws_secret_access_key"), and everything such as stop, 
login, destroy, etc. works perfect. I am wondering if we can do some changes on 
the spark code.


> spark-ec2 cannot stop running cluster
> -------------------------------------
>
>                 Key: SPARK-10191
>                 URL: https://issues.apache.org/jira/browse/SPARK-10191
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2
>         Environment: AWS EC2
>            Reporter: Ruofan Kong
>
> Using command spark-ec2, I've created a cluster with name 
> "ruofan-large-cluster" within a virtual private cloud (vpc) on AWS-EC2. The 
> cluster contains one master and two slave nodes, and it works very well. Now 
> I would like to stop the cluster for a while, and then restart it. However, 
> when I type the bash command as follow:
> {code}
> $ ./spark-ec2 --region=us-east-1 stop ruofan-large-cluster
> {code}
> It showed up the following output:
> {code}
> Are you sure you want to stop the cluster ruofan-large-cluster?
> DATA ON EPHEMERAL DISKS WILL BE LOST, BUT THE CLUSTER WILL KEEP USING SPACE ON
> AMAZON EBS IF IT IS EBS-BACKED!!
> All data on spot-instance slaves will be lost.
> Stop cluster ruofan-large-cluster (y/N): y
> Searching for existing cluster ruofan-large-cluster in region us-east-1...
> Stopping master...
> Stopping slaves...
> {code}
> It didn't stop the cluster at all... I'm sure the information including both 
> my cluster name and cluster region are both correct, and I also tried the 
> following command to stop the cluster:
> {code}
> $ ./spark-ec2 -k <key-file-name> -i <key-file> -r us-east-1 
> --vpc-id=<my-vpc-id> --subnet-id=<my-subnet-id> stop ruofan-large-cluster
> {code}
> It still showed the same output, and it didn't stop any cluster. So I spent 
> several hours on this problem, and I think the official Spark code 
> spark-ec2.py may have a bug for identifying cluster name so I can't stop 
> clusters. I am using spark-1.4.0, and in most of cases, spark-ec2.py works 
> very well if I directly launch clusters on AWS without subnet of vpc. 
> However, if I launch my cluster on a subnet of a vpc on AWS, then 
> spark-ec2.py is unable to find the cluster so I can't stop the cluster. 
> Specifically, in spark-ec2.py, there is a small segment of code as below:
> {code}
> conn = ec2.connect_to_region(opts.region)
> {code}
> Whenever we do actions such as launch, login, stop or destroy the cluster, 
> spark-ec2 will firstly connect to the specified region using the above code, 
> and then gets all satisfied instances by reservations 
> =conn.get_all_reservations(filter=\{some conditions\}). It works very well if 
> I directly launch my cluster without subnet of vpc. If my cluster is in the 
> subnet of vpc, then conn.get_all_reservations() gets nothing. Just now I 
> modified the original code with `conn = ec2.connect_to_region(opts.region, 
> aws_access_key_id="my_aws_access_key_id", 
> aws_secret_access_key="my_aws_secret_access_key"), and everything such as 
> stop, login, destroy, etc. works perfect. I am wondering if we can do some 
> changes on the spark code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-10191) spark-ec2 cannot stop running cluster

Reply via email to