Elek, Marton created HADOOP-14898:
-------------------------------------
Summary: Create official Docker images for development and testing
features
Key: HADOOP-14898
URL: https://issues.apache.org/jira/browse/HADOOP-14898
Project: Hadoop Common
Issue Type: Improvement
Reporter: Elek, Marton
Assignee: Elek, Marton
This is the original mail from the mailing list:
{code}
TL;DR: I propose to create official hadoop images and upload them to the
dockerhub.
GOAL/SCOPE: I would like improve the existing documentation with easy-to-use
docker based recipes to start hadoop clusters with various configuration.
The images also could be used to test experimental features. For example ozone
could be tested easily with these compose file and configuration:
https://gist.github.com/elek/1676a97b98f4ba561c9f51fce2ab2ea6
Or even the configuration could be included in the compose file:
https://github.com/elek/hadoop/blob/docker-2.8.0/example/docker-compose.yaml
I would like to create separated example compose files for federation, ha,
metrics usage, etc. to make it easier to try out and understand the features.
CONTEXT: There is an existing Jira
https://issues.apache.org/jira/browse/HADOOP-13397
But it’s about a tool to generate production quality docker images (multiple
types, in a flexible way). If no objections, I will create a separated issue to
create simplified docker images for rapid prototyping and investigating new
features. And register the branch to the dockerhub to create the images
automatically.
MY BACKGROUND: I am working with docker based hadoop/spark clusters quite a
while and run them succesfully in different environments (kubernetes,
docker-swarm, nomad-based scheduling, etc.) My work is available from here:
https://github.com/flokkr but they could handle more complex use cases (eg.
instrumenting java processes with btrace, or read/reload configuration from
consul).
And IMHO in the official hadoop documentation it’s better to suggest to use
official apache docker images and not external ones (which could be changed).
{code}
The next list will enumerate the key decision points regarding to docker image
creating
A. automated dockerhub build / jenkins build
Docker images could be built on the dockerhub (a branch pattern should be
defined for a github repository and the location of the Docker files) or could
be built on a CI server and pushed.
The second one is more flexible (it's more easy to create matrix build, for
example)
The first one had the advantage that we can get an additional flag on the
dockerhub that the build is automated (and built from the source by the
dockerhub).
The decision is easy as ASF supports the first approach: (see
https://issues.apache.org/jira/browse/INFRA-12781?focusedCommentId=15824096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15824096)
B. source: binary distribution or source build
The second question is about creating the docker image. One option is to build
the software on the fly during the creation of the docker image the other one
is to use the binary releases.
I suggest to use the second approach as:
1. In that case the hadoop:2.7.3 could contain exactly the same hadoop
distrubution as the downloadable one
2. We don't need to add development tools to the image, the image could be more
smaller (which is important as the goal for this image to getting started as
fast as possible)
3. The docker definition will be more simple (and more easy to maintain)
Usually this approach is used in other projects (I checked Apache Zeppelin and
Apache Nutch)
C. branch usage
Other question is the location of the Docker file. It could be on the official
source-code branches (branch-2, trunk, etc.) or we can create separated
branches for the dockerhub (eg. docker/2.7 docker/2.8 docker/3.0)
For the first approach it's easier to find the docker images, but it's less
flexible. For example if we had a Dockerfile for on the source code it should
be used for every release (for example the Docker file from the tag
release-3.0.0 should be used for the 3.0 hadoop docker image). In that case the
release process is much more harder: in case of a Dockerfile error (which could
be test on dockerhub only after the taging), a new release should be added
after fixing the Dockerfile.
Another problem is that with using tags it's not possible to improve the
Dockerfiles. I can imagine that we would like to improve for example the
hadoop:2.7 images (for example adding more smart startup scripts) with using
exactly the same hadoop 2.7 distribution.
Finally with tag based approach we can't create images for the older releases
(2.8.1 for example)
So I suggest to create separated branches for the Dockerfiles.
D. Versions
We can create a separated branch for every version (2.7.1/2.7.2/2.7.3) or just
for the main version (2.8/2.7). As these docker images are not for the
production but for prototyping I suggest to use (at least as a first step) just
the 2.7/2.8 and update the images during the bugfix release.
E. Number of images
There are two options here, too: Create a separated image for every component
(namenode, datanode, etc.) or just one, and the command should be defined
everywhere manually. The second seems to be more complex (to use), but I think
the maintenance is easier, and it's more visible what should be started
F. Snapshots
According to the spirit of the Release policy:
https://www.apache.org/dev/release-distribution.html#unreleased
We should distribute only final releases to the dockerhub and not snapshots.
But we can create an empty hadoop-runner image as well, which container the
starter scripts but not hadoop. It would be used for development locally where
the newly built distribution could be maped to the image with docker volumes.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]