[ 
https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304412#comment-14304412
 ] 

Florian Verhein commented on SPARK-5552:
----------------------------------------

Thanks [~sowen]. 

So it wouldn't fit in the spark repo itself (the only change there would be to 
add an option in spark_ec2.py to use an alternate spark-ec2 repo/branch). It 
would naturally live in spark-ec2, as it  involves changes to spark-ec2 for 
both use cases
- Image creation is based on the work soon to be added to spark-ec2 for this: 
https://issues.apache.org/jira/browse/SPARK-3821
- Cluster deployment+configuration is done using the spark-ec2 scripts 
themselves (but with many modifications/fixes).

Since there is a dependency between the image and the configuration (init.sh 
and setup.sh) scripts, it's not possible to solve this with just an AMI.

The extra components (actually, just vowpal wabbit and more python libraries - 
the rest already exists in spark-ec2 AMI) are just added to the image for data 
science convenience.


> Automated data science AMI creation and data science cluster deployment on EC2
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-5552
>                 URL: https://issues.apache.org/jira/browse/SPARK-5552
>             Project: Spark
>          Issue Type: New Feature
>          Components: EC2
>            Reporter: Florian Verhein
>
> Issue created RE: 
> https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read 
> for background)
> Goal:
> Extend spark-ec2 scripts to create an automated data science cluster 
> deployment on EC2, suitable for almost(?)-production use.
> Use cases: 
> - A user can build their own custom data science AMIs from a CentOS minimal 
> image by calling a packer configuration (good defaults should be provided, 
> some options for flexibility)
> - A user can then easily deploy a new (correctly configured) cluster using 
> these AMIs, and do so as quickly as possible.
> Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R 
> + vowpal wabbit + any rpms + ... + ganglia
> Focus is on reliability (rather than e.g. supporting many versions / dev 
> testing) and speed of deployment.
> Use hadoop 2 so option to lift into yarn later.
> My current solution is here: 
> https://github.com/florianverhein/spark-ec2/tree/packer. It includes other 
> fixes/improvements as needed to get it working.
> Now that it seems to work (but has deviated a lot more from the existing code 
> base than I was expecting), I'm wondering what to do with it...
> Keen to hear ideas if anyone is interested. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to