[ 
https://issues.apache.org/jira/browse/HADOOP-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381264#comment-16381264
 ] 

Anu Engineer commented on HADOOP-14898:
---------------------------------------

{quote}The infra process linked through the document (INFRA-12781) wasn't clear 
to me. We create branches- this proposes a base image, {{hadoop2}}, and 
{{hadoop3}}- and that will be pushed to the ASF repo on Docker Hub?
{quote}
[~chris.douglas] Thanks for the comments. ([~elek] Please correct me I am 
saying something stupid. With the caveat that my understanding is not very deep 
on this. I will try to answer this to the best of my understanding)
 # We will have 2 base images – one for version 2 of Hadoop and one from 
version 3 of Hadoop.
 # When we deploy an image, for example, {{docker-compose pull; docker-compose 
up}} what happens is that we will pull down these base images and start 2 or 
more containers. One or more Namenode and a set of data nodes.
 # docker-compose command will work directly against the docker-hub. When we 
release a new version of Hadoop, we will create the corresponding base images – 
I asked [~elek] in a private conversation if we need one image per release and 
we thought we don't need it for now, since the base images are very similar for 
each release, so we don't expect to see much change between the images – hence 
one base image for Hadoop 2 and another one for Hadoop 3.
 # In the release tarball, we will ship the base image or leave a pointer to 
the new base image – so that PMC can try this out and approve it. One major 
win, at least in my mind is that more developers(PMC members will be willing to 
try out these applications, hence it is a single command to launch a docker 
based Hadoop Cluster).
 # Once the release is signed off, we will push those base images to DockerHub. 
The dockerHub account credentials are something that INFRA owns, so will have 
to file a ticket to update these images, since we are going to use Apache 
DockerHub Account. I am told many other Apache projects follow this process.

If you like, *I can set up a call so we discuss and hash out the details*. 
Please let me know if that is something that will be useful.

>From personal experience in Ozone branch, I can tell you that most devs make a 
>change and actually test it before pushing them ever since [~elek] supported 
>that feature for ozone. It has been a great QE tool for Ozone, and I feel that 
>it will be very useful to have this feature for HDFS.

 

> Create official Docker images for development and testing features 
> -------------------------------------------------------------------
>
>                 Key: HADOOP-14898
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14898
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Major
>         Attachments: HADOOP-14898.001.tar.gz, HADOOP-14898.002.tar.gz, 
> HADOOP-14898.003.tgz, docker_design.pdf
>
>
> This is the original mail from the mailing list:
> {code}
> TL;DR: I propose to create official hadoop images and upload them to the 
> dockerhub.
> GOAL/SCOPE: I would like improve the existing documentation with easy-to-use 
> docker based recipes to start hadoop clusters with various configuration.
> The images also could be used to test experimental features. For example 
> ozone could be tested easily with these compose file and configuration:
> https://gist.github.com/elek/1676a97b98f4ba561c9f51fce2ab2ea6
> Or even the configuration could be included in the compose file:
> https://github.com/elek/hadoop/blob/docker-2.8.0/example/docker-compose.yaml
> I would like to create separated example compose files for federation, ha, 
> metrics usage, etc. to make it easier to try out and understand the features.
> CONTEXT: There is an existing Jira 
> https://issues.apache.org/jira/browse/HADOOP-13397
> But it’s about a tool to generate production quality docker images (multiple 
> types, in a flexible way). If no objections, I will create a separated issue 
> to create simplified docker images for rapid prototyping and investigating 
> new features. And register the branch to the dockerhub to create the images 
> automatically.
> MY BACKGROUND: I am working with docker based hadoop/spark clusters quite a 
> while and run them succesfully in different environments (kubernetes, 
> docker-swarm, nomad-based scheduling, etc.) My work is available from here: 
> https://github.com/flokkr but they could handle more complex use cases (eg. 
> instrumenting java processes with btrace, or read/reload configuration from 
> consul).
>  And IMHO in the official hadoop documentation it’s better to suggest to use 
> official apache docker images and not external ones (which could be changed).
> {code}
> The next list will enumerate the key decision points regarding to docker 
> image creating
> A. automated dockerhub build  / jenkins build
> Docker images could be built on the dockerhub (a branch pattern should be 
> defined for a github repository and the location of the Docker files) or 
> could be built on a CI server and pushed.
> The second one is more flexible (it's more easy to create matrix build, for 
> example)
> The first one had the advantage that we can get an additional flag on the 
> dockerhub that the build is automated (and built from the source by the 
> dockerhub).
> The decision is easy as ASF supports the first approach: (see 
> https://issues.apache.org/jira/browse/INFRA-12781?focusedCommentId=15824096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15824096)
> B. source: binary distribution or source build
> The second question is about creating the docker image. One option is to 
> build the software on the fly during the creation of the docker image the 
> other one is to use the binary releases.
> I suggest to use the second approach as:
> 1. In that case the hadoop:2.7.3 could contain exactly the same hadoop 
> distrubution as the downloadable one
> 2. We don't need to add development tools to the image, the image could be 
> more smaller (which is important as the goal for this image to getting 
> started as fast as possible)
> 3. The docker definition will be more simple (and more easy to maintain)
> Usually this approach is used in other projects (I checked Apache Zeppelin 
> and Apache Nutch)
> C. branch usage
> Other question is the location of the Docker file. It could be on the 
> official source-code branches (branch-2, trunk, etc.) or we can create 
> separated branches for the dockerhub (eg. docker/2.7 docker/2.8 docker/3.0)
> For the first approach it's easier to find the docker images, but it's less 
> flexible. For example if we had a Dockerfile for on the source code it should 
> be used for every release (for example the Docker file from the tag 
> release-3.0.0 should be used for the 3.0 hadoop docker image). In that case 
> the release process is much more harder: in case of a Dockerfile error (which 
> could be test on dockerhub only after the taging), a new release should be 
> added after fixing the Dockerfile.
> Another problem is that with using tags it's not possible to improve the 
> Dockerfiles. I can imagine that we would like to improve for example the 
> hadoop:2.7 images (for example adding more smart startup scripts) with using 
> exactly the same hadoop 2.7 distribution. 
> Finally with tag based approach we can't create images for the older releases 
> (2.8.1 for example)
> So I suggest to create separated branches for the Dockerfiles.
> D. Versions
> We can create a separated branch for every version (2.7.1/2.7.2/2.7.3) or 
> just for the main version (2.8/2.7). As these docker images are not for the 
> production but for prototyping I suggest to use (at least as a first step) 
> just the 2.7/2.8 and update the images during the bugfix release.
> E. Number of images
> There are two options here, too: Create a separated image for every component 
> (namenode, datanode, etc.) or just one, and the command should be defined 
> everywhere manually. The second seems to be more complex (to use), but I 
> think the maintenance is easier, and it's more visible what should be started 
> F. Snapshots
> According to the spirit of the Release policy:
> https://www.apache.org/dev/release-distribution.html#unreleased
> We should distribute only final releases to the dockerhub and not snapshots. 
> But we can create an empty hadoop-runner image as well, which container the 
> starter scripts but not hadoop. It would be used for development locally 
> where the newly built distribution could be maped to the image with docker 
> volumes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to