Re: IO ITs: Hosting Docker images

Stephen Sisk Tue, 04 Apr 2017 10:16:38 -0700

I'd like to hear what direction folks want to go in, and from there look at
the options. I think for some of these options (like running our own public
registry), they may be able to and it's something we should look at, but I
don't assume they have time to work on this type of issue.


S

On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik <[email protected]>
wrote:

> Is this something that Apache infra could help us with?
>
> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <[email protected]>
> wrote:
>
> > Summary:
> >
> > For IO ITs that use data stores that need custom docker images in order
> to
> > run, we can't currently use them in a kubernetes cluster (which is where
> we
> > host our data stores.) I have a couple options for how to solve this and
> am
> > looking for feedback from folks involved in creating IO ITs/opinions on
> > kubernetes.
> >
> >
> > Details:
> >
> > We've discussed in the past that we'll want to allow developers to submit
> > just a dockerfile, and then we'll use that when creating the data store
> on
> > kubernetes. This is the case for ElasticsearchIO and I assume more data
> > stores in the future will want to do this. It's also looking like it'll
> be
> > necessary to use custom docker images for the HadoopInputFormatIO's
> > cassandra ITs - to run a cassandra cluster, there doesn't seem to be a
> good
> > image you can use out of the box.
> >
> > In either case, in order to retrieve a docker image, kubernetes needs a
> > container registry - it will read the docker images from there. A simple
> > private container registry doesn't work because kubernetes config files
> are
> > static - this means that if local devs try to use the kubernetes files,
> > they point at the private container registry and they wouldn't be able to
> > retrieve the images since they don't have access. They'd have to manually
> > edit the files, which in theory is an option, but I don't consider that
> to
> > be acceptable since it feels pretty unfriendly (it is simple, so if we
> > really don't like the below options we can revisit it.)
> >
> > Quick summary of the options
> >
> > =======================
> >
> > We can:
> >
> > * Start using something like k8 helm - this adds more dependencies, adds
> a
> > small amount of complexity (this is my recommendation, but only by a
> > little)
> >
> > * Start pushing images to docker hub - this means they'll be publicly
> > visible and raises the bar for maintenance of those images
> >
> > * Host our own public container registry - this means running our own
> > public service with costs, etc..
> >
> > Below are detailed discussions of these options. You can skip to the "My
> > thoughts on this" section if you're not interested in the details.
> >
> >
> > 1. Templated kubernetes images
> >
> > =========================
> >
> > Kubernetes (k8) does not currently have built in support for
> parameterizing
> > scripts - there's an issues open for this[1], but it doesn't seem to be
> > very active.
> >
> > There are tools like Kubernetes helm that allow users to specify
> parameters
> > when running their kubernetes scripts. They also enable a lot more
> (they're
> > probably closer to a package manager like apt-get) - see this
> > description[3] for an overview.
> >
> > I'm open to other options besides helm, but it seems to be the officially
> > supported one.
> >
> > How the world would look using helm:
> >
> > * When developing an IO IT, someone (either the developer or one of us),
> > would need to create a chart (the name for the helm script) - it's
> > basically another set of config files but in theory is as simple as a
> > couple metadata files plus a templatized version of a regular k8 script.
> > This should be trivial compared to the task of creating a k8 script.
> >
> > *  When creating an instance of a data store, the developer (or the beam
> CI
> > server) would first build the docker image for the data store and push to
> > their container registry, then run a command like `helm install -f
> > mydb.yaml --set imageRepo=1.2.3.4`
> >
> > * when done running tests/developing/etc…  the developer/beam CI server
> > would run `helm delete -f mydb.yaml`
> >
> > Upsides:
> >
> > * Something like helm is pretty interesting - we talked about it as an
> > upside and something we wanted to do when we talked about using
> kubernetes
> >
> > * We pick up a set of working kubernetes scripts this way. The full list
> is
> > at [2], but some ones that stood out: mongodb, memcached, mysql,
> postgres,
> > redis, elasticsearch (incubating), kafka (incubating), zookeeper
> > (incubating) - this could speed development
> >
> > Downsides:
> >
> > * Adds an additional dependency to run our ITs (helm or another k8
> > templating tool)
> >
> > * Requires people to build their own images run a container registry if
> > they don't already have one (it will not surprise you that there's a
> docker
> > image for running the registry [0] - so it's not crazy. :) I *think* this
> > will probably just be a simple one/two line command once we have it
> > scripted.
> >
> > * Helm in particular is kind of heavyweight for what we really need - it
> > requires running a service in the k8 cluster and adds additional
> > complexity.
> >
> > * Adds to the complexity of creating a new kubernetes script. Until I've
> > tried it, I can't really speak to the complexity, but taking a look at
> the
> > instructions [4], it doesn't seem too bad.
> >
> >
> >
> >
> > 2. Push images to docker hub
> >
> > =======================
> >
> > This requires that users push images that we want to use to docker hub,
> and
> > then our IO ITs will rely on that. I  think the developer of the
> dockerfile
> > should be responsible for the image - having the beam project responsible
> > for a publicly available artifact (like the docker images) outside of our
> > core deliverables doesn't seem like the right move.
> >
> > We would still retain a copy of the source dockerfiles and could
> regenerate
> > the images at any time, so I'm not concerned about a scenario where
> docker
> > hub went away (it would be pretty simple to switch to another repo - just
> > change some config files.)
> >
> > For someone running the k8 scripts (ie, running the IO ITs), this is
> pretty
> > easy - they just run the k8 script like they do today.
> >
> > For someone creating the k8 scripts (ie, creating the IO ITs), this is
> more
> > complex - either they or we have to push this to docker hub and make sure
> > it's up to date, etc..
> >
> >
> > Upsides:
> >
> > * No additional complexity for IO IT runners.
> >
> > Downsides:
> >
> > * Higher bar for creating the image in the first place - someone has to
> > maintain the publicly available docker hub image.
> >
> > * It seems weird to have a custom docker image up on docker hub - maybe
> > that's common, but if we need specific changes to images for our needs,
> I'd
> > prefer it be private.
> >
> >
> > 3. Run our own *public* container registry
> >
> > ==============================================
> >
> > We would run a beam-specific container registry service - it would be
> used
> > by the apache beam CI servers, but it would also be available for use by
> > anyone running beam IO ITs on their local dev setup.
> >
> > From a IO IT creator's perspective, this would look pretty similar to how
> > things are now - they just check in a dockerfile. For someone running the
> > k8 scripts, they similarly don't need to think about it.
> >
> > Upsides:
> >
> > * we're not adding any additional complexity for end developer
> >
> > Downsides:
> >
> > * Have to keep docker registry software up to date
> >
> > * The service is a single of failure for any beam devs running IO ITs
> >
> > * It can incur costs, etc… As an open source project, it doesn't seem
> great
> > for us to be running a public service.
> >
> >
> >
> > My thoughts on this
> >
> > ===============
> >
> > In spite of the additional complexity, I think using k8 helm is probably
> > the best option. The general goal behind the IO ITs has been to keep
> > ourselves self-contained: avoid having centralized infrastructure for
> those
> > running the ITs. Helm is a good match for those criteria. I will admit
> that
> > I find the additional dependencies/complexity to be worrisome. However, I
> > really like the idea of picking up additional data store configs for
> free -
> > if we were doing this in 5 years, we'd say "we should just use the
> > ecosystem of helm charts" and go from there.
> >
> > I do think that pushing images to docker hub is a viable option, and if
> the
> > community is more excited to do that/wants to push the images there, I'd
> > support it. I can see how folks would be hesitant. I would like for the
> > developer of the docker file to do
> >
> > Of the 3 options, I would strongly push back against running a public
> > container registry - I would not want to administer it, and I don't think
> > we as a project want to be paying for the costs associated with it.
> >
> > Next steps
> >
> > =========
> >
> > Let me know what you think! This is definitely a topic where
> understanding
> > what the community of IO devs wants is helpful. As we discuss, I'll
> > probably spend a little time exploring helm since I want to play around
> > with it and understand if there are other drawbacks. I ran into this
> > question while working on getting the HIFIO cassandra cluster running,
> so I
> > might prototype with that.
> >
> > I'll create JIRA for this in the next day or so.
> >
> > Stephen
> >
> >
> >
> > [0] docker registry container - https://hub.docker.com/_/registry/
> >
> > [1] kubernetes issue open for supporting templates -
> > https://github.com/kubernetes/kubernetes/issues/23896
> >
> > [2] set of available charts - https://github.com/kubernetes/charts
> >
> > [3] kubernetes helm introduction -
> > https://deis.com/blog/2015/introducing-helm-for-kubernetes/
> > [4] kubernetes charts instructions -
> > https://github.com/kubernetes/helm/blob/master/docs/charts.md
> >
>

Re: IO ITs: Hosting Docker images

Reply via email to