[ 
https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306530#comment-14306530
 ] 

Florian Verhein commented on SPARK-5552:
----------------------------------------

Just updated the two READMEs to match the code. 

Some difficulties I see in merging this work in are:
* I built this because I needed it, fast. So it's not easy to split into bug 
fixes / improvement chunks for spark-ec2 PRs (e.g. off the top of my head: fix 
SPARK-3185 and all other version incompatibility bugs I found, use Oracle JDK 
instead of OpenJDK, make ganglia work, allow spark to communicate with Tachyon, 
fix local disk mounting, fix some other config issues, allow modules to be 
installed on image, etc...)
* It uses CentOS minimal, whereas spark-ec2 currently uses Amazon Linux. Hence 
there are a few extra CentOS specific steps.
* Some possible roadmap differences (???):
    1. Here, the intention is that the user builds and owns their own AMI using 
the supplied scripts, whereas I believe spark-ec2 currently aims to supply the 
AMIs.
    2. Here, the focus is on getting version dependencies and configuration 
right, which naturally limits what can be supported (better for 
use/deployment), whereas it looks like spark-ec2 aims to be more flexible 
(better for dev/testing).  

Options ( ??? ):
* Include it as a separate branch in spark-ec2 if people feel it's sufficiently 
aligned, perhaps merge more closely down the track
* Create a new repo
* ??

Thoughts?


> Automated data science AMI creation and data science cluster deployment on EC2
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-5552
>                 URL: https://issues.apache.org/jira/browse/SPARK-5552
>             Project: Spark
>          Issue Type: New Feature
>          Components: EC2
>            Reporter: Florian Verhein
>
> Issue created RE: 
> https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read 
> for background)
> Goal:
> Extend spark-ec2 scripts to create an automated data science cluster 
> deployment on EC2, suitable for almost(?)-production use.
> Use cases: 
> - A user can build their own custom data science AMIs from a CentOS minimal 
> image by calling a packer configuration (good defaults should be provided, 
> some options for flexibility)
> - A user can then easily deploy a new (correctly configured) cluster using 
> these AMIs, and do so as quickly as possible.
> Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R 
> + vowpal wabbit + any rpms + ... + ganglia
> Focus is on reliability (rather than e.g. supporting many versions / dev 
> testing) and speed of deployment.
> Use hadoop 2 so option to lift into yarn later.
> My current solution is here: 
> https://github.com/florianverhein/spark-ec2/tree/packer. It includes other 
> fixes/improvements as needed to get it working.
> Now that it seems to work (but has deviated a lot more from the existing code 
> base than I was expecting), I'm wondering what to do with it...
> Keen to hear ideas if anyone is interested. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to