Hey Ismael,

I definitely agree with you that we want something that developers will
actually be able to/want to use.

in my experience *all* the container orchestration engines are non-trivial
to set up. When I started examining solutions for beam hosting, I did
installs of mesos, kubernetes and docker. Docker is easier in the "run only
on my local machine" case if devs have it set up, but to do anything
interesting (ie, interact with machines that aren't already yours), they
all involve work to get them setup on each machine you want to use[4].

Kubernetes has some options that make it extremely simple to setup - both
AWS[2] and GCE[3] seem to be straightforward to set up for simple dev
clusters, with scripts to automate the process (I'm assuming docker has
similar setups.)

Once kubernetes is set up, it's also a simple yaml file + command to set up
multiple machines. The kubernetes setup for postgres[5] shows a simple one
machine example, and the kubernetes setups for HIFIO[6] show multi-machine

We've spent a lot of time discussing the various options - when we talked
about this earlier [1] we decided we would move forward with investigating
kubernetes, so that's what I used for the IO ITs work I've been doing,
which we've now gotten working.

Do you feel the advantages of docker are such that we should re-open the
discussion and potentially re-do the work we've done so far to get k8

I took a genuine look at docker earlier in the process and it didn't seem
like it was better than the other options in any dimensions (other than
"developers usually have it installed already"), and kubernetes/mesos
seemed to be more stable/have more of the features discussed in [1].
Perhaps that's changed?

I think we are just starting to use container orchestration engines, and so
while I don't want to throw away the work we've done so far, I also don't
want to have to do it later if there are reasons we knew about now. :)



[2] k8 AWS - https://kubernetes.io/docs/getting-started-guides/aws/
[3] k8 GKE - https://cloud.google.com/container-engine/docs/quickstart or
[4] docker swarm on GCE -

[5] postgres k8 script -


On Mon, Mar 20, 2017 at 3:25 PM Ismaël Mejía <ieme...@apache.org> wrote:

I have somehow forgotten this one.

> Basically - I'm trying to keep number of tools at a minimum while still
> providing good support for the functionality we need. Does docker-compose
> provide something beyond the functionality that k8 does? I'm not familiar
> with docker-compose, but looking at
> https://docs.docker.com/ it doesn't
> seem to provide anything that k8 doesn't already.

I agree to have the most minimal set of tools, I mentioned
docker-compose because I consider also its advantages because its
installation is trivial compared to kubernetes (or even minikube for a
local install), docker-compose does not have any significant advantage
over kubernetes apart of been easier to install/use.

But well, better to be consistent and go full with kubernetes, however
we need to find a way to help IO authors to bootstrap this, because
from my experience creating a cluster with docker-compose is a yaml
file + a command, not sure if the basic installation and run of
kubernetes is that easy.


On Wed, Mar 15, 2017 at 8:09 PM, Stephen Sisk <s...@google.com.invalid>
> thanks for the discussion! In general, I agree with the sentiments
> expressed here. I updated
> to
> reflect this discussion. (The plan is still that I will put that on the
> website.)
> Apache Docker Repository - are you talking about
> https://hub.docker.com/u/apache/ ? If not, can you point me at more info?
> can't seem to find info about this on the publicly visible apache-infra
> mailing lists thatI could find, and the apache infra website doesn't seem
> to mention a docker repository.
>> However the current Beam Elasticsearch IO does not support Elasticsearch
> 5, and elastic does not have an image for version 2, so in this
particular case
> following the priority order we should use the official docker image (2)
> for the tests (assuming that both require the same version). Do you agree
> with this ?
> Yup, that makes sense to me.
>> How do we deal with IOs that require more than one base image, this is
a  common
> scenario for projects that depend on Zookeeper?
> Is there a reason not to just run a kubernetes ReplicaController+Service
> for these cases? k8 can easily support having a hostname that pods can
> on having the zookeeper instance. It also uses text config - see
> and sets up the connections/nameservice between the hosts - if other tests
> wanted to rely on postgres, it could just connect to host "postgres" and
> postgres is there.
> Basically - I'm trying to keep number of tools at a minimum while still
> providing good support for the functionality we need. Does docker-compose
> provide something beyond the functionality that k8 does? I'm not familiar
> with docker-compose, but looking at
> https://docs.docker.com/compose/overview/#compose-documentation it doesn't
> seem to provide anything that k8 doesn't already.
> S
> On Wed, Mar 15, 2017 at 7:10 AM Ismaël Mejía <ieme...@gmail.com> wrote:
> Hi, Thanks for bringing this subject to the mailing list.
> +1
> We definitely need a consensus on this, and I agree with your proposal and
> JB’s comments modulo certain clarifications:
> I think we shall go in this priority order if the version of the image we
> want is available:
> 1. Image provided by the creator of the data source/sink (if they
> officially maintain it). (This is the case of Elasticsearch for example)
> the Apache projects (if they provide one) as JB mentions.
> 2. Official docker images (because they have security fixes and have
> guaranteed maintenance.
> 3. Non-official docker images or images from other providers that have
> maintainers e.g. quay.io
> It makes sense to use the same image for all the tests. and to use the
> fixed versions supported by the respective IO to avoid possible issues
> during testing between different versions/naming of env variables, etc.
> The Elasticsearch case is a 'good' example because it shows all the
> issues:
> We should not use one elasticsearch image (elk) for some tests and a
> different one for the other (the quay one), and if we resolve by priority
> we would take the image provided by the creator (1) for both cases.
> However the current Beam Elasticsearch IO does not support Elasticsearch
> and elastic does not have an image for version 2, so in this particular
> case following the priority order we should use the official docker image
> (2) for the tests (assuming that both require the same version).
> Do you agree with this ?
> Thinking about the ELK image I came with a new question. How do we deal
> with IOs that require more than one base image, this is a common scenario
> for projects that depend on Zookeeper? e.g. Kafka/Solr.  Usually people
> coordinate those with a docker-compose file that creates an artificial
> network to connect the Zookeeper image and the Kafka/Solr one
> just executing the 'docker-compose up' command
> . Will we adopt this for such cases ?
> I know that Kubernetes does this too, but the docker-compose format is
> quite easy and textual,
> and it is usually ready with the docker installation, additionally the
> docker-compose files can easily be translated with kompose into Kubernetes
> resources.
> Ismaël
> On Wed, Mar 15, 2017 at 3:17 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>> Hi Stephen,
>> 1. About the docker repositories, we now have official Docker repo at
>> Apache. So, for the Apache projects, I would recommend the Apache
>> repo. Anyway, generally speaking, I would recommend the official repo
> (from
>> the projects).
>> 2. To avoid "unpredictable" breaking change, I would pin to a particular
>> versions, and explicitly update if needed.
>> 3. It's better that docker images are under an unique responsibility
>> as different IOs can use the same resources, so they should use the same
>> provided docker.
>> By the way, I also have a docker coming for RedisIO ;)
>> Regards
>> JB
>> On 03/15/2017 08:01 AM, Stephen Sisk wrote:
>>> hi!
>>> as part of doing the work to enable IO ITs, we decided we want to use
>>> docker. As part of that, we need to run docker images and they'll
> probably
>>> be pulled from a docker repository.
>>> Questions:
>>> * What docker repositories (and users on docker hub) do we as a group
>>> allow
>>> for images we'll run for hosted data stores?
>>>  -> My proposal is we should only use repositories/images that are
>>> regularly updated and that have someone saying that the images we depend
>>> on
>>> are secure. In the set of images currently linked to by checked in
> code/in
>>> PR code, quay.io and official docker images seem fine. They both have
>>> security scans (for what that's worth) and generally seem okay.
>>> * Do we pin to particular docker images or allow our version to float?
>>>  -> I have seen docker images change in insecure way (e.g. switching the
>>> name of the password parameter, meaning that the data store was secure
>>> when
>>> set up, and became insecure because no password was set after the image
>>> update), so I'd prefer to pin to particular versions, and update on a
>>> periodic basis.
>>> I'm relatively new to docker best practices, so I'm open to suggestions
> on
>>> this.
>>> Current ITs with docker images:
>>> * Jdbc - https://hub.docker.com/_/postgres/  (official image)
>>> * Elasticsearch - https://hub.docker.com/r/sebp/elk/ (semi-official
>>> looking
>>> image)
>>> * (PR in-flight
>>> <https://github.com/apache/beam/pull/2193/files#diff-a630b5f
>>> ff9aebc9e99a3f324c9cf75a9R52>)
>>> HadoopInputFormat's elasticsearch and cassandra tests -
>>> https://hub.docker.com/_/cassandra/ and
>>> https://quay.io/repository/pires/docker-elasticsearch-kubern
>>> etes?tag=5.2.2&tab=tags
>>> (official image, and image from quay.io, which provides security audits
>>> of
>>> their images)
>>> The more I think about it, the less I'm excited about the sebp/elk image
> -
>>> I'm sure it's fine, but I'd prefer using images from a source that we
> know
>>> is trying to check for security problems.
>>> There's a secondary problem that we're using two different elasticsearch
>>> images - I'd like to use only one image. I'll follow up on that -
>>> https://issues.apache.org/jira/browse/BEAM-1644
>>> S
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

Reply via email to