[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276263#comment-14276263
 ] 

Florian Verhein commented on SPARK-3821:
----------------------------------------

This is great stuff! It'll also help serve as some documentation for AMI 
requirements when using the spark-ec2 scripts.  

Re the above, I think everything in create_image.sh can be refactored to packer 
(+ duplicate removal - e.g. root login). I've attempted to do this in a fork of 
[~nchammas]'s work, but my use case is a bit different in that I need to go 
from a fresh centos6 minimal (rather than an amazon linux AMI) and then add 
other things.

Possibly related to AMI generation in general: I've noticed that the version 
dependencies in the spark-ec2 scripts are broken. I suspect this will need to 
be handled in both the image and the setup. For example:
- It looks like Spark needs to be built with the right hadoop profile to work, 
but this isn't adhered to. This applies when spark is built from a git checkout 
or from an existing build. This is likely also the case with Tachyon too. 
Probably the cause of https://issues.apache.org/jira/browse/SPARK-3185
- The hadoop native libs are built on the image using 2.4.1, but then copied 
into whatever hadoop build is downloaded in the ephemeral-hdfs and 
persistent-hdfs scripts. I suspect that could cause issues too. Since building 
hadoop is very time consuming, it's something you'd wan't on the image - hence 
creating a dependency. 
- The version dependencies for other things like ganglia aren't documented (I 
believe this is installed on the image but duplicated again in 
spark-ec2/ganglia). I've found that the ganglia config doesn't work for me (but 
recall I'm using a different base AMI, so I'll likely get a different ganglia 
version). I have a sneaky suspicion that the hadoop configs in spark-ec2 won't 
work across the hadoop versions either (but, fingers crossed!).

Re the above, I might try keeping the entire hadoop build (from the image 
creation) for the hdfs setup.

Sorry for the sidetrack, but struggling though all this so hoping it might ring 
a bell for someone.  

p.s. With the image automation, it might also be worth considering putting more 
on the image as an option (esp for people happy to build their own AMIs). For 
example, I see no reason why the module init.sh scripts can't be run from 
packer in order to speed start-up times of the cluster :) 


> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-3821
>                 URL: https://issues.apache.org/jira/browse/SPARK-3821
>             Project: Spark
>          Issue Type: Improvement
>          Components: Build, EC2
>            Reporter: Nicholas Chammas
>            Assignee: Nicholas Chammas
>         Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to