
For IO ITs that use data stores that need custom docker images in order to
run, we can't currently use them in a kubernetes cluster (which is where we
host our data stores.) I have a couple options for how to solve this and am
looking for feedback from folks involved in creating IO ITs/opinions on


We've discussed in the past that we'll want to allow developers to submit
just a dockerfile, and then we'll use that when creating the data store on
kubernetes. This is the case for ElasticsearchIO and I assume more data
stores in the future will want to do this. It's also looking like it'll be
necessary to use custom docker images for the HadoopInputFormatIO's
cassandra ITs - to run a cassandra cluster, there doesn't seem to be a good
image you can use out of the box.

In either case, in order to retrieve a docker image, kubernetes needs a
container registry - it will read the docker images from there. A simple
private container registry doesn't work because kubernetes config files are
static - this means that if local devs try to use the kubernetes files,
they point at the private container registry and they wouldn't be able to
retrieve the images since they don't have access. They'd have to manually
edit the files, which in theory is an option, but I don't consider that to
be acceptable since it feels pretty unfriendly (it is simple, so if we
really don't like the below options we can revisit it.)

Quick summary of the options


We can:

* Start using something like k8 helm - this adds more dependencies, adds a
small amount of complexity (this is my recommendation, but only by a little)

* Start pushing images to docker hub - this means they'll be publicly
visible and raises the bar for maintenance of those images

* Host our own public container registry - this means running our own
public service with costs, etc..

Below are detailed discussions of these options. You can skip to the "My
thoughts on this" section if you're not interested in the details.

1. Templated kubernetes images


Kubernetes (k8) does not currently have built in support for parameterizing
scripts - there's an issues open for this[1], but it doesn't seem to be
very active.

There are tools like Kubernetes helm that allow users to specify parameters
when running their kubernetes scripts. They also enable a lot more (they're
probably closer to a package manager like apt-get) - see this
description[3] for an overview.

I'm open to other options besides helm, but it seems to be the officially
supported one.

How the world would look using helm:

* When developing an IO IT, someone (either the developer or one of us),
would need to create a chart (the name for the helm script) - it's
basically another set of config files but in theory is as simple as a
couple metadata files plus a templatized version of a regular k8 script.
This should be trivial compared to the task of creating a k8 script.

*  When creating an instance of a data store, the developer (or the beam CI
server) would first build the docker image for the data store and push to
their container registry, then run a command like `helm install -f
mydb.yaml --set imageRepo=`

* when done running tests/developing/etc…  the developer/beam CI server
would run `helm delete -f mydb.yaml`


* Something like helm is pretty interesting - we talked about it as an
upside and something we wanted to do when we talked about using kubernetes

* We pick up a set of working kubernetes scripts this way. The full list is
at [2], but some ones that stood out: mongodb, memcached, mysql, postgres,
redis, elasticsearch (incubating), kafka (incubating), zookeeper
(incubating) - this could speed development


* Adds an additional dependency to run our ITs (helm or another k8
templating tool)

* Requires people to build their own images run a container registry if
they don't already have one (it will not surprise you that there's a docker
image for running the registry [0] - so it's not crazy. :) I *think* this
will probably just be a simple one/two line command once we have it

* Helm in particular is kind of heavyweight for what we really need - it
requires running a service in the k8 cluster and adds additional complexity.

* Adds to the complexity of creating a new kubernetes script. Until I've
tried it, I can't really speak to the complexity, but taking a look at the
instructions [4], it doesn't seem too bad.

2. Push images to docker hub


This requires that users push images that we want to use to docker hub, and
then our IO ITs will rely on that. I  think the developer of the dockerfile
should be responsible for the image - having the beam project responsible
for a publicly available artifact (like the docker images) outside of our
core deliverables doesn't seem like the right move.

We would still retain a copy of the source dockerfiles and could regenerate
the images at any time, so I'm not concerned about a scenario where docker
hub went away (it would be pretty simple to switch to another repo - just
change some config files.)

For someone running the k8 scripts (ie, running the IO ITs), this is pretty
easy - they just run the k8 script like they do today.

For someone creating the k8 scripts (ie, creating the IO ITs), this is more
complex - either they or we have to push this to docker hub and make sure
it's up to date, etc..


* No additional complexity for IO IT runners.


* Higher bar for creating the image in the first place - someone has to
maintain the publicly available docker hub image.

* It seems weird to have a custom docker image up on docker hub - maybe
that's common, but if we need specific changes to images for our needs, I'd
prefer it be private.

3. Run our own *public* container registry


We would run a beam-specific container registry service - it would be used
by the apache beam CI servers, but it would also be available for use by
anyone running beam IO ITs on their local dev setup.

>From a IO IT creator's perspective, this would look pretty similar to how
things are now - they just check in a dockerfile. For someone running the
k8 scripts, they similarly don't need to think about it.


* we're not adding any additional complexity for end developer


* Have to keep docker registry software up to date

* The service is a single of failure for any beam devs running IO ITs

* It can incur costs, etc… As an open source project, it doesn't seem great
for us to be running a public service.

My thoughts on this


In spite of the additional complexity, I think using k8 helm is probably
the best option. The general goal behind the IO ITs has been to keep
ourselves self-contained: avoid having centralized infrastructure for those
running the ITs. Helm is a good match for those criteria. I will admit that
I find the additional dependencies/complexity to be worrisome. However, I
really like the idea of picking up additional data store configs for free -
if we were doing this in 5 years, we'd say "we should just use the
ecosystem of helm charts" and go from there.

I do think that pushing images to docker hub is a viable option, and if the
community is more excited to do that/wants to push the images there, I'd
support it. I can see how folks would be hesitant. I would like for the
developer of the docker file to do

Of the 3 options, I would strongly push back against running a public
container registry - I would not want to administer it, and I don't think
we as a project want to be paying for the costs associated with it.

Next steps


Let me know what you think! This is definitely a topic where understanding
what the community of IO devs wants is helpful. As we discuss, I'll
probably spend a little time exploring helm since I want to play around
with it and understand if there are other drawbacks. I ran into this
question while working on getting the HIFIO cassandra cluster running, so I
might prototype with that.

I'll create JIRA for this in the next day or so.


[0] docker registry container -

[1] kubernetes issue open for supporting templates -

[2] set of available charts -

[3] kubernetes helm introduction -
[4] kubernetes charts instructions -

Reply via email to