[
https://issues.apache.org/jira/browse/HADOOP-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748677#comment-16748677
]
Elek, Marton commented on HADOOP-14898:
---------------------------------------
Created a follow-up HADOOP-16063. Please check if you are interested.
> Create official Docker images for development and testing features
> -------------------------------------------------------------------
>
> Key: HADOOP-14898
> URL: https://issues.apache.org/jira/browse/HADOOP-14898
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: Elek, Marton
> Assignee: Elek, Marton
> Priority: Major
> Fix For: 3.1.0
>
> Attachments: HADOOP-14898.001.tar.gz, HADOOP-14898.002.tar.gz,
> HADOOP-14898.003.tgz, docker_design.pdf
>
>
> This is the original mail from the mailing list:
> {code}
> TL;DR: I propose to create official hadoop images and upload them to the
> dockerhub.
> GOAL/SCOPE: I would like improve the existing documentation with easy-to-use
> docker based recipes to start hadoop clusters with various configuration.
> The images also could be used to test experimental features. For example
> ozone could be tested easily with these compose file and configuration:
> https://gist.github.com/elek/1676a97b98f4ba561c9f51fce2ab2ea6
> Or even the configuration could be included in the compose file:
> https://github.com/elek/hadoop/blob/docker-2.8.0/example/docker-compose.yaml
> I would like to create separated example compose files for federation, ha,
> metrics usage, etc. to make it easier to try out and understand the features.
> CONTEXT: There is an existing Jira
> https://issues.apache.org/jira/browse/HADOOP-13397
> But it’s about a tool to generate production quality docker images (multiple
> types, in a flexible way). If no objections, I will create a separated issue
> to create simplified docker images for rapid prototyping and investigating
> new features. And register the branch to the dockerhub to create the images
> automatically.
> MY BACKGROUND: I am working with docker based hadoop/spark clusters quite a
> while and run them succesfully in different environments (kubernetes,
> docker-swarm, nomad-based scheduling, etc.) My work is available from here:
> https://github.com/flokkr but they could handle more complex use cases (eg.
> instrumenting java processes with btrace, or read/reload configuration from
> consul).
> And IMHO in the official hadoop documentation it’s better to suggest to use
> official apache docker images and not external ones (which could be changed).
> {code}
> The next list will enumerate the key decision points regarding to docker
> image creating
> A. automated dockerhub build / jenkins build
> Docker images could be built on the dockerhub (a branch pattern should be
> defined for a github repository and the location of the Docker files) or
> could be built on a CI server and pushed.
> The second one is more flexible (it's more easy to create matrix build, for
> example)
> The first one had the advantage that we can get an additional flag on the
> dockerhub that the build is automated (and built from the source by the
> dockerhub).
> The decision is easy as ASF supports the first approach: (see
> https://issues.apache.org/jira/browse/INFRA-12781?focusedCommentId=15824096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15824096)
> B. source: binary distribution or source build
> The second question is about creating the docker image. One option is to
> build the software on the fly during the creation of the docker image the
> other one is to use the binary releases.
> I suggest to use the second approach as:
> 1. In that case the hadoop:2.7.3 could contain exactly the same hadoop
> distrubution as the downloadable one
> 2. We don't need to add development tools to the image, the image could be
> more smaller (which is important as the goal for this image to getting
> started as fast as possible)
> 3. The docker definition will be more simple (and more easy to maintain)
> Usually this approach is used in other projects (I checked Apache Zeppelin
> and Apache Nutch)
> C. branch usage
> Other question is the location of the Docker file. It could be on the
> official source-code branches (branch-2, trunk, etc.) or we can create
> separated branches for the dockerhub (eg. docker/2.7 docker/2.8 docker/3.0)
> For the first approach it's easier to find the docker images, but it's less
> flexible. For example if we had a Dockerfile for on the source code it should
> be used for every release (for example the Docker file from the tag
> release-3.0.0 should be used for the 3.0 hadoop docker image). In that case
> the release process is much more harder: in case of a Dockerfile error (which
> could be test on dockerhub only after the taging), a new release should be
> added after fixing the Dockerfile.
> Another problem is that with using tags it's not possible to improve the
> Dockerfiles. I can imagine that we would like to improve for example the
> hadoop:2.7 images (for example adding more smart startup scripts) with using
> exactly the same hadoop 2.7 distribution.
> Finally with tag based approach we can't create images for the older releases
> (2.8.1 for example)
> So I suggest to create separated branches for the Dockerfiles.
> D. Versions
> We can create a separated branch for every version (2.7.1/2.7.2/2.7.3) or
> just for the main version (2.8/2.7). As these docker images are not for the
> production but for prototyping I suggest to use (at least as a first step)
> just the 2.7/2.8 and update the images during the bugfix release.
> E. Number of images
> There are two options here, too: Create a separated image for every component
> (namenode, datanode, etc.) or just one, and the command should be defined
> everywhere manually. The second seems to be more complex (to use), but I
> think the maintenance is easier, and it's more visible what should be started
> F. Snapshots
> According to the spirit of the Release policy:
> https://www.apache.org/dev/release-distribution.html#unreleased
> We should distribute only final releases to the dockerhub and not snapshots.
> But we can create an empty hadoop-runner image as well, which container the
> starter scripts but not hadoop. It would be used for development locally
> where the newly built distribution could be maped to the image with docker
> volumes.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]