Is this something that Apache infra could help us with? On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <[email protected]> wrote:
> Summary: > > For IO ITs that use data stores that need custom docker images in order to > run, we can't currently use them in a kubernetes cluster (which is where we > host our data stores.) I have a couple options for how to solve this and am > looking for feedback from folks involved in creating IO ITs/opinions on > kubernetes. > > > Details: > > We've discussed in the past that we'll want to allow developers to submit > just a dockerfile, and then we'll use that when creating the data store on > kubernetes. This is the case for ElasticsearchIO and I assume more data > stores in the future will want to do this. It's also looking like it'll be > necessary to use custom docker images for the HadoopInputFormatIO's > cassandra ITs - to run a cassandra cluster, there doesn't seem to be a good > image you can use out of the box. > > In either case, in order to retrieve a docker image, kubernetes needs a > container registry - it will read the docker images from there. A simple > private container registry doesn't work because kubernetes config files are > static - this means that if local devs try to use the kubernetes files, > they point at the private container registry and they wouldn't be able to > retrieve the images since they don't have access. They'd have to manually > edit the files, which in theory is an option, but I don't consider that to > be acceptable since it feels pretty unfriendly (it is simple, so if we > really don't like the below options we can revisit it.) > > Quick summary of the options > > ======================= > > We can: > > * Start using something like k8 helm - this adds more dependencies, adds a > small amount of complexity (this is my recommendation, but only by a > little) > > * Start pushing images to docker hub - this means they'll be publicly > visible and raises the bar for maintenance of those images > > * Host our own public container registry - this means running our own > public service with costs, etc.. > > Below are detailed discussions of these options. You can skip to the "My > thoughts on this" section if you're not interested in the details. > > > 1. Templated kubernetes images > > ========================= > > Kubernetes (k8) does not currently have built in support for parameterizing > scripts - there's an issues open for this[1], but it doesn't seem to be > very active. > > There are tools like Kubernetes helm that allow users to specify parameters > when running their kubernetes scripts. They also enable a lot more (they're > probably closer to a package manager like apt-get) - see this > description[3] for an overview. > > I'm open to other options besides helm, but it seems to be the officially > supported one. > > How the world would look using helm: > > * When developing an IO IT, someone (either the developer or one of us), > would need to create a chart (the name for the helm script) - it's > basically another set of config files but in theory is as simple as a > couple metadata files plus a templatized version of a regular k8 script. > This should be trivial compared to the task of creating a k8 script. > > * When creating an instance of a data store, the developer (or the beam CI > server) would first build the docker image for the data store and push to > their container registry, then run a command like `helm install -f > mydb.yaml --set imageRepo=1.2.3.4` > > * when done running tests/developing/etc… the developer/beam CI server > would run `helm delete -f mydb.yaml` > > Upsides: > > * Something like helm is pretty interesting - we talked about it as an > upside and something we wanted to do when we talked about using kubernetes > > * We pick up a set of working kubernetes scripts this way. The full list is > at [2], but some ones that stood out: mongodb, memcached, mysql, postgres, > redis, elasticsearch (incubating), kafka (incubating), zookeeper > (incubating) - this could speed development > > Downsides: > > * Adds an additional dependency to run our ITs (helm or another k8 > templating tool) > > * Requires people to build their own images run a container registry if > they don't already have one (it will not surprise you that there's a docker > image for running the registry [0] - so it's not crazy. :) I *think* this > will probably just be a simple one/two line command once we have it > scripted. > > * Helm in particular is kind of heavyweight for what we really need - it > requires running a service in the k8 cluster and adds additional > complexity. > > * Adds to the complexity of creating a new kubernetes script. Until I've > tried it, I can't really speak to the complexity, but taking a look at the > instructions [4], it doesn't seem too bad. > > > > > 2. Push images to docker hub > > ======================= > > This requires that users push images that we want to use to docker hub, and > then our IO ITs will rely on that. I think the developer of the dockerfile > should be responsible for the image - having the beam project responsible > for a publicly available artifact (like the docker images) outside of our > core deliverables doesn't seem like the right move. > > We would still retain a copy of the source dockerfiles and could regenerate > the images at any time, so I'm not concerned about a scenario where docker > hub went away (it would be pretty simple to switch to another repo - just > change some config files.) > > For someone running the k8 scripts (ie, running the IO ITs), this is pretty > easy - they just run the k8 script like they do today. > > For someone creating the k8 scripts (ie, creating the IO ITs), this is more > complex - either they or we have to push this to docker hub and make sure > it's up to date, etc.. > > > Upsides: > > * No additional complexity for IO IT runners. > > Downsides: > > * Higher bar for creating the image in the first place - someone has to > maintain the publicly available docker hub image. > > * It seems weird to have a custom docker image up on docker hub - maybe > that's common, but if we need specific changes to images for our needs, I'd > prefer it be private. > > > 3. Run our own *public* container registry > > ============================================== > > We would run a beam-specific container registry service - it would be used > by the apache beam CI servers, but it would also be available for use by > anyone running beam IO ITs on their local dev setup. > > From a IO IT creator's perspective, this would look pretty similar to how > things are now - they just check in a dockerfile. For someone running the > k8 scripts, they similarly don't need to think about it. > > Upsides: > > * we're not adding any additional complexity for end developer > > Downsides: > > * Have to keep docker registry software up to date > > * The service is a single of failure for any beam devs running IO ITs > > * It can incur costs, etc… As an open source project, it doesn't seem great > for us to be running a public service. > > > > My thoughts on this > > =============== > > In spite of the additional complexity, I think using k8 helm is probably > the best option. The general goal behind the IO ITs has been to keep > ourselves self-contained: avoid having centralized infrastructure for those > running the ITs. Helm is a good match for those criteria. I will admit that > I find the additional dependencies/complexity to be worrisome. However, I > really like the idea of picking up additional data store configs for free - > if we were doing this in 5 years, we'd say "we should just use the > ecosystem of helm charts" and go from there. > > I do think that pushing images to docker hub is a viable option, and if the > community is more excited to do that/wants to push the images there, I'd > support it. I can see how folks would be hesitant. I would like for the > developer of the docker file to do > > Of the 3 options, I would strongly push back against running a public > container registry - I would not want to administer it, and I don't think > we as a project want to be paying for the costs associated with it. > > Next steps > > ========= > > Let me know what you think! This is definitely a topic where understanding > what the community of IO devs wants is helpful. As we discuss, I'll > probably spend a little time exploring helm since I want to play around > with it and understand if there are other drawbacks. I ran into this > question while working on getting the HIFIO cassandra cluster running, so I > might prototype with that. > > I'll create JIRA for this in the next day or so. > > Stephen > > > > [0] docker registry container - https://hub.docker.com/_/registry/ > > [1] kubernetes issue open for supporting templates - > https://github.com/kubernetes/kubernetes/issues/23896 > > [2] set of available charts - https://github.com/kubernetes/charts > > [3] kubernetes helm introduction - > https://deis.com/blog/2015/introducing-helm-for-kubernetes/ > [4] kubernetes charts instructions - > https://github.com/kubernetes/helm/blob/master/docs/charts.md >
