[ https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516854#comment-14516854 ]
Sean Owen commented on SPARK-5189: ---------------------------------- [~jackli066519] You don't need to have this assigned to you, but I would work with [~nchammas] to understand first whether this is still relevant or what he's done. > Reorganize EC2 scripts so that nodes can be provisioned independent of Spark > master > ----------------------------------------------------------------------------------- > > Key: SPARK-5189 > URL: https://issues.apache.org/jira/browse/SPARK-5189 > Project: Spark > Issue Type: Improvement > Components: EC2 > Reporter: Nicholas Chammas > > As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, > then setting up all the slaves together. This includes broadcasting files > from the lonely master to potentially hundreds of slaves. > There are 2 main problems with this approach: > # Broadcasting files from the master to all slaves using > [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] > (e.g. during [ephemeral-hdfs > init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], > or during [Spark > setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) > takes a long time. This time increases as the number of slaves increases. > I did some testing in {{us-east-1}}. This is, concretely, what the problem > looks like: > || number of slaves ({{m3.large}}) || launch time (best of 6 tries) || > | 1 | 8m 44s | > | 10 | 13m 45s | > | 25 | 22m 50s | > | 50 | 37m 30s | > | 75 | 51m 30s | > | 99 | 1h 5m 30s | > Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, > but I think the point is clear enough. > # It's more complicated to add slaves to an existing cluster (a la > [SPARK-2008]), since slaves are only configured through the master during the > setup of the master itself. > Logically, the operations we want to implement are: > * Provision a Spark node > * Join a node to a cluster (including an empty cluster) as either a master or > a slave > * Remove a node from a cluster > We need our scripts to roughly be organized to match the above operations. > The goals would be: > # When launching a cluster, enable all cluster nodes to be provisioned in > parallel, removing the master-to-slave file broadcast bottleneck. > # Facilitate cluster modifications like adding or removing nodes. > # Enable exploration of infrastructure tools like > [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} > internals and perhaps even allow us to build [one tool that launches Spark > clusters on several different cloud > platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. > More concretely, the modifications we need to make are: > * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with > equivalent, slave-side operations. > * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure > it fully creates a node that can be used as either a master or slave. > * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, > configures it as a master or slave, and joins it to a cluster. > * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete > that script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org