[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276263#comment-14276263 ]
Florian Verhein commented on SPARK-3821: ---------------------------------------- This is great stuff! It'll also help serve as some documentation for AMI requirements when using the spark-ec2 scripts. Re the above, I think everything in create_image.sh can be refactored to packer (+ duplicate removal - e.g. root login). I've attempted to do this in a fork of [~nchammas]'s work, but my use case is a bit different in that I need to go from a fresh centos6 minimal (rather than an amazon linux AMI) and then add other things. Possibly related to AMI generation in general: I've noticed that the version dependencies in the spark-ec2 scripts are broken. I suspect this will need to be handled in both the image and the setup. For example: - It looks like Spark needs to be built with the right hadoop profile to work, but this isn't adhered to. This applies when spark is built from a git checkout or from an existing build. This is likely also the case with Tachyon too. Probably the cause of https://issues.apache.org/jira/browse/SPARK-3185 - The hadoop native libs are built on the image using 2.4.1, but then copied into whatever hadoop build is downloaded in the ephemeral-hdfs and persistent-hdfs scripts. I suspect that could cause issues too. Since building hadoop is very time consuming, it's something you'd wan't on the image - hence creating a dependency. - The version dependencies for other things like ganglia aren't documented (I believe this is installed on the image but duplicated again in spark-ec2/ganglia). I've found that the ganglia config doesn't work for me (but recall I'm using a different base AMI, so I'll likely get a different ganglia version). I have a sneaky suspicion that the hadoop configs in spark-ec2 won't work across the hadoop versions either (but, fingers crossed!). Re the above, I might try keeping the entire hadoop build (from the image creation) for the hdfs setup. Sorry for the sidetrack, but struggling though all this so hoping it might ring a bell for someone. p.s. With the image automation, it might also be worth considering putting more on the image as an option (esp for people happy to build their own AMIs). For example, I see no reason why the module init.sh scripts can't be run from packer in order to speed start-up times of the cluster :) > Develop an automated way of creating Spark images (AMI, Docker, and others) > --------------------------------------------------------------------------- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 > Reporter: Nicholas Chammas > Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org