[ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517301#comment-14517301
 ] 

Nicholas Chammas commented on SPARK-5189:
-----------------------------------------

Yeah, as Sean said you can just start working on this whenever you want. Just 
let us know over here in a comment and that way others can know that someone is 
already working on this.

This issue is still relevant, but unfortunately, solving it requires 
redesigning the whole of spark-ec2 to be able to provision nodes in parallel. 
This means changing the Bash scripts in the mesos/spark-ec2 repo to act on 1 
node at a time, and changing the main spark-ec2 script itself to be 
multi-threaded (or somehow otherwise asynchronous) to be able to manage several 
nodes in parallel.

It's probably a major effort, but you can definitely take it on if you are 
interested.

> Reorganize EC2 scripts so that nodes can be provisioned independent of Spark 
> master
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-5189
>                 URL: https://issues.apache.org/jira/browse/SPARK-5189
>             Project: Spark
>          Issue Type: Improvement
>          Components: EC2
>            Reporter: Nicholas Chammas
>
> As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
> then setting up all the slaves together. This includes broadcasting files 
> from the lonely master to potentially hundreds of slaves.
> There are 2 main problems with this approach:
> # Broadcasting files from the master to all slaves using 
> [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
> (e.g. during [ephemeral-hdfs 
> init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
>  or during [Spark 
> setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
>  takes a long time. This time increases as the number of slaves increases.
>  I did some testing in {{us-east-1}}. This is, concretely, what the problem 
> looks like:
>  || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
> | 1 | 8m 44s |
> | 10 | 13m 45s |
> | 25 | 22m 50s |
> | 50 | 37m 30s |
> | 75 | 51m 30s |
> | 99 | 1h 5m 30s |
>  Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, 
> but I think the point is clear enough.
> # It's more complicated to add slaves to an existing cluster (a la 
> [SPARK-2008]), since slaves are only configured through the master during the 
> setup of the master itself.
> Logically, the operations we want to implement are:
> * Provision a Spark node
> * Join a node to a cluster (including an empty cluster) as either a master or 
> a slave
> * Remove a node from a cluster
> We need our scripts to roughly be organized to match the above operations. 
> The goals would be:
> # When launching a cluster, enable all cluster nodes to be provisioned in 
> parallel, removing the master-to-slave file broadcast bottleneck.
> # Facilitate cluster modifications like adding or removing nodes.
> # Enable exploration of infrastructure tools like 
> [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
> internals and perhaps even allow us to build [one tool that launches Spark 
> clusters on several different cloud 
> platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].
> More concretely, the modifications we need to make are:
> * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
> equivalent, slave-side operations.
> * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure 
> it fully creates a node that can be used as either a master or slave.
> * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
> configures it as a master or slave, and joins it to a cluster.
> * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
> that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to