[ 
https://issues.apache.org/jira/browse/HADOOP-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382311#comment-16382311
 ] 

Elek, Marton commented on HADOOP-14898:
---------------------------------------

Just some minor clarification:

 * Images are built by dockerhub. INFRA can register the branches on dockerhub 
and dockerhub will automatically fetch the latest branch and build the images 
on change. (From the docker-* branches not the real branches which contain only 
the Dockerfile files)

 * I proposed to use the binary releases for the image creation which are voted 
by PMC members. The image will contain exactly the same Hadoop. There *won't be 
and mvn build * during the docker image creation just some packaging (download 
hadoop and extract it). 

 * The hadoop-runner images won't be released with every hadoop release. 
Ideally we need to creat it once. As it contains only a startup script I think 
we can handle it outside of the release structure (this is how it is handled by 
other Apache projects). 

* HADOOP-15259 will provide a Dockerfile for developer images with snapshot 
version of Hadoop *locally*. They won't be uploaded to the dockerhub.
Summary:

 * apache/hadoop-runner:latest  --  contains only the base OS + a very simple 
starter script . Only latest version is available, no releases.
 * apache/hadoop:2 -- contains the latest stable hadoop 2 release (official 
binary release, approved by the PMC) on top of hadoop-runner:latest
 * apache/hadoop:3 -- same, but with the latest hadoop 3 release. (It should be 
updated *after* the release vote with the official download link) 

I would be more than happy to participate on a call if you are interested about 
how it could be used or discuss the implementation. Please propose a time 
period or we can also continue the discussion here.

> Create official Docker images for development and testing features 
> -------------------------------------------------------------------
>
>                 Key: HADOOP-14898
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14898
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Major
>         Attachments: HADOOP-14898.001.tar.gz, HADOOP-14898.002.tar.gz, 
> HADOOP-14898.003.tgz, docker_design.pdf
>
>
> This is the original mail from the mailing list:
> {code}
> TL;DR: I propose to create official hadoop images and upload them to the 
> dockerhub.
> GOAL/SCOPE: I would like improve the existing documentation with easy-to-use 
> docker based recipes to start hadoop clusters with various configuration.
> The images also could be used to test experimental features. For example 
> ozone could be tested easily with these compose file and configuration:
> https://gist.github.com/elek/1676a97b98f4ba561c9f51fce2ab2ea6
> Or even the configuration could be included in the compose file:
> https://github.com/elek/hadoop/blob/docker-2.8.0/example/docker-compose.yaml
> I would like to create separated example compose files for federation, ha, 
> metrics usage, etc. to make it easier to try out and understand the features.
> CONTEXT: There is an existing Jira 
> https://issues.apache.org/jira/browse/HADOOP-13397
> But it’s about a tool to generate production quality docker images (multiple 
> types, in a flexible way). If no objections, I will create a separated issue 
> to create simplified docker images for rapid prototyping and investigating 
> new features. And register the branch to the dockerhub to create the images 
> automatically.
> MY BACKGROUND: I am working with docker based hadoop/spark clusters quite a 
> while and run them succesfully in different environments (kubernetes, 
> docker-swarm, nomad-based scheduling, etc.) My work is available from here: 
> https://github.com/flokkr but they could handle more complex use cases (eg. 
> instrumenting java processes with btrace, or read/reload configuration from 
> consul).
>  And IMHO in the official hadoop documentation it’s better to suggest to use 
> official apache docker images and not external ones (which could be changed).
> {code}
> The next list will enumerate the key decision points regarding to docker 
> image creating
> A. automated dockerhub build  / jenkins build
> Docker images could be built on the dockerhub (a branch pattern should be 
> defined for a github repository and the location of the Docker files) or 
> could be built on a CI server and pushed.
> The second one is more flexible (it's more easy to create matrix build, for 
> example)
> The first one had the advantage that we can get an additional flag on the 
> dockerhub that the build is automated (and built from the source by the 
> dockerhub).
> The decision is easy as ASF supports the first approach: (see 
> https://issues.apache.org/jira/browse/INFRA-12781?focusedCommentId=15824096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15824096)
> B. source: binary distribution or source build
> The second question is about creating the docker image. One option is to 
> build the software on the fly during the creation of the docker image the 
> other one is to use the binary releases.
> I suggest to use the second approach as:
> 1. In that case the hadoop:2.7.3 could contain exactly the same hadoop 
> distrubution as the downloadable one
> 2. We don't need to add development tools to the image, the image could be 
> more smaller (which is important as the goal for this image to getting 
> started as fast as possible)
> 3. The docker definition will be more simple (and more easy to maintain)
> Usually this approach is used in other projects (I checked Apache Zeppelin 
> and Apache Nutch)
> C. branch usage
> Other question is the location of the Docker file. It could be on the 
> official source-code branches (branch-2, trunk, etc.) or we can create 
> separated branches for the dockerhub (eg. docker/2.7 docker/2.8 docker/3.0)
> For the first approach it's easier to find the docker images, but it's less 
> flexible. For example if we had a Dockerfile for on the source code it should 
> be used for every release (for example the Docker file from the tag 
> release-3.0.0 should be used for the 3.0 hadoop docker image). In that case 
> the release process is much more harder: in case of a Dockerfile error (which 
> could be test on dockerhub only after the taging), a new release should be 
> added after fixing the Dockerfile.
> Another problem is that with using tags it's not possible to improve the 
> Dockerfiles. I can imagine that we would like to improve for example the 
> hadoop:2.7 images (for example adding more smart startup scripts) with using 
> exactly the same hadoop 2.7 distribution. 
> Finally with tag based approach we can't create images for the older releases 
> (2.8.1 for example)
> So I suggest to create separated branches for the Dockerfiles.
> D. Versions
> We can create a separated branch for every version (2.7.1/2.7.2/2.7.3) or 
> just for the main version (2.8/2.7). As these docker images are not for the 
> production but for prototyping I suggest to use (at least as a first step) 
> just the 2.7/2.8 and update the images during the bugfix release.
> E. Number of images
> There are two options here, too: Create a separated image for every component 
> (namenode, datanode, etc.) or just one, and the command should be defined 
> everywhere manually. The second seems to be more complex (to use), but I 
> think the maintenance is easier, and it's more visible what should be started 
> F. Snapshots
> According to the spirit of the Release policy:
> https://www.apache.org/dev/release-distribution.html#unreleased
> We should distribute only final releases to the dockerhub and not snapshots. 
> But we can create an empty hadoop-runner image as well, which container the 
> starter scripts but not hadoop. It would be used for development locally 
> where the newly built distribution could be maped to the image with docker 
> volumes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to