Elek, Marton updated HADOOP-14898:
    Status: Patch Available  (was: In Progress)

Moving it to patch-available state as all the sub-issues are in patch-available.

> Create official Docker images for development and testing features 
> -------------------------------------------------------------------
>                 Key: HADOOP-14898
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14898
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Major
>         Attachments: HADOOP-14898.001.tar.gz, HADOOP-14898.002.tar.gz, 
> HADOOP-14898.003.tgz, docker_design.pdf
> This is the original mail from the mailing list:
> {code}
> TL;DR: I propose to create official hadoop images and upload them to the 
> dockerhub.
> GOAL/SCOPE: I would like improve the existing documentation with easy-to-use 
> docker based recipes to start hadoop clusters with various configuration.
> The images also could be used to test experimental features. For example 
> ozone could be tested easily with these compose file and configuration:
> https://gist.github.com/elek/1676a97b98f4ba561c9f51fce2ab2ea6
> Or even the configuration could be included in the compose file:
> https://github.com/elek/hadoop/blob/docker-2.8.0/example/docker-compose.yaml
> I would like to create separated example compose files for federation, ha, 
> metrics usage, etc. to make it easier to try out and understand the features.
> CONTEXT: There is an existing Jira 
> https://issues.apache.org/jira/browse/HADOOP-13397
> But it’s about a tool to generate production quality docker images (multiple 
> types, in a flexible way). If no objections, I will create a separated issue 
> to create simplified docker images for rapid prototyping and investigating 
> new features. And register the branch to the dockerhub to create the images 
> automatically.
> MY BACKGROUND: I am working with docker based hadoop/spark clusters quite a 
> while and run them succesfully in different environments (kubernetes, 
> docker-swarm, nomad-based scheduling, etc.) My work is available from here: 
> https://github.com/flokkr but they could handle more complex use cases (eg. 
> instrumenting java processes with btrace, or read/reload configuration from 
> consul).
>  And IMHO in the official hadoop documentation it’s better to suggest to use 
> official apache docker images and not external ones (which could be changed).
> {code}
> The next list will enumerate the key decision points regarding to docker 
> image creating
> A. automated dockerhub build  / jenkins build
> Docker images could be built on the dockerhub (a branch pattern should be 
> defined for a github repository and the location of the Docker files) or 
> could be built on a CI server and pushed.
> The second one is more flexible (it's more easy to create matrix build, for 
> example)
> The first one had the advantage that we can get an additional flag on the 
> dockerhub that the build is automated (and built from the source by the 
> dockerhub).
> The decision is easy as ASF supports the first approach: (see 
> https://issues.apache.org/jira/browse/INFRA-12781?focusedCommentId=15824096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15824096)
> B. source: binary distribution or source build
> The second question is about creating the docker image. One option is to 
> build the software on the fly during the creation of the docker image the 
> other one is to use the binary releases.
> I suggest to use the second approach as:
> 1. In that case the hadoop:2.7.3 could contain exactly the same hadoop 
> distrubution as the downloadable one
> 2. We don't need to add development tools to the image, the image could be 
> more smaller (which is important as the goal for this image to getting 
> started as fast as possible)
> 3. The docker definition will be more simple (and more easy to maintain)
> Usually this approach is used in other projects (I checked Apache Zeppelin 
> and Apache Nutch)
> C. branch usage
> Other question is the location of the Docker file. It could be on the 
> official source-code branches (branch-2, trunk, etc.) or we can create 
> separated branches for the dockerhub (eg. docker/2.7 docker/2.8 docker/3.0)
> For the first approach it's easier to find the docker images, but it's less 
> flexible. For example if we had a Dockerfile for on the source code it should 
> be used for every release (for example the Docker file from the tag 
> release-3.0.0 should be used for the 3.0 hadoop docker image). In that case 
> the release process is much more harder: in case of a Dockerfile error (which 
> could be test on dockerhub only after the taging), a new release should be 
> added after fixing the Dockerfile.
> Another problem is that with using tags it's not possible to improve the 
> Dockerfiles. I can imagine that we would like to improve for example the 
> hadoop:2.7 images (for example adding more smart startup scripts) with using 
> exactly the same hadoop 2.7 distribution. 
> Finally with tag based approach we can't create images for the older releases 
> (2.8.1 for example)
> So I suggest to create separated branches for the Dockerfiles.
> D. Versions
> We can create a separated branch for every version (2.7.1/2.7.2/2.7.3) or 
> just for the main version (2.8/2.7). As these docker images are not for the 
> production but for prototyping I suggest to use (at least as a first step) 
> just the 2.7/2.8 and update the images during the bugfix release.
> E. Number of images
> There are two options here, too: Create a separated image for every component 
> (namenode, datanode, etc.) or just one, and the command should be defined 
> everywhere manually. The second seems to be more complex (to use), but I 
> think the maintenance is easier, and it's more visible what should be started 
> F. Snapshots
> According to the spirit of the Release policy:
> https://www.apache.org/dev/release-distribution.html#unreleased
> We should distribute only final releases to the dockerhub and not snapshots. 
> But we can create an empty hadoop-runner image as well, which container the 
> starter scripts but not hadoop. It would be used for development locally 
> where the newly built distribution could be maped to the image with docker 
> volumes.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to