thanks for the discussion! In general, I agree with the sentiments expressed here. I updated https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.hlirex1vus1a to reflect this discussion. (The plan is still that I will put that on the website.)
Apache Docker Repository - are you talking about https://hub.docker.com/u/apache/ ? If not, can you point me at more info? I can't seem to find info about this on the publicly visible apache-infra mailing lists thatI could find, and the apache infra website doesn't seem to mention a docker repository. > However the current Beam Elasticsearch IO does not support Elasticsearch 5, and elastic does not have an image for version 2, so in this particular case following the priority order we should use the official docker image (2) for the tests (assuming that both require the same version). Do you agree with this ? Yup, that makes sense to me. > How do we deal with IOs that require more than one base image, this is a > common scenario for projects that depend on Zookeeper? Is there a reason not to just run a kubernetes ReplicaController+Service for these cases? k8 can easily support having a hostname that pods can rely on having the zookeeper instance. It also uses text config - see https://github.com/apache/beam/tree/master/sdks/java/io/jdbc/src/test/resources/kubernetes, and sets up the connections/nameservice between the hosts - if other tests wanted to rely on postgres, it could just connect to host "postgres" and postgres is there. Basically - I'm trying to keep number of tools at a minimum while still providing good support for the functionality we need. Does docker-compose provide something beyond the functionality that k8 does? I'm not familiar with docker-compose, but looking at https://docs.docker.com/compose/overview/#compose-documentation it doesn't seem to provide anything that k8 doesn't already. S On Wed, Mar 15, 2017 at 7:10 AM Ismaël Mejía <ieme...@gmail.com> wrote: Hi, Thanks for bringing this subject to the mailing list. +1 We definitely need a consensus on this, and I agree with your proposal and JB’s comments modulo certain clarifications: I think we shall go in this priority order if the version of the image we want is available: 1. Image provided by the creator of the data source/sink (if they officially maintain it). (This is the case of Elasticsearch for example) or the Apache projects (if they provide one) as JB mentions. 2. Official docker images (because they have security fixes and have guaranteed maintenance. 3. Non-official docker images or images from other providers that have good maintainers e.g. quay.io It makes sense to use the same image for all the tests. and to use the fixed versions supported by the respective IO to avoid possible issues during testing between different versions/naming of env variables, etc. The Elasticsearch case is a 'good' example because it shows all the current issues: We should not use one elasticsearch image (elk) for some tests and a different one for the other (the quay one), and if we resolve by priority we would take the image provided by the creator (1) for both cases. https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html However the current Beam Elasticsearch IO does not support Elasticsearch 5, and elastic does not have an image for version 2, so in this particular case following the priority order we should use the official docker image (2) for the tests (assuming that both require the same version). Do you agree with this ? Thinking about the ELK image I came with a new question. How do we deal with IOs that require more than one base image, this is a common scenario for projects that depend on Zookeeper? e.g. Kafka/Solr. Usually people coordinate those with a docker-compose file that creates an artificial network to connect the Zookeeper image and the Kafka/Solr one just executing the 'docker-compose up' command . Will we adopt this for such cases ? I know that Kubernetes does this too, but the docker-compose format is quite easy and textual, and it is usually ready with the docker installation, additionally the docker-compose files can easily be translated with kompose into Kubernetes resources. Ismaël On Wed, Mar 15, 2017 at 3:17 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Stephen, > > 1. About the docker repositories, we now have official Docker repo at > Apache. So, for the Apache projects, I would recommend the Apache official > repo. Anyway, generally speaking, I would recommend the official repo (from > the projects). > > 2. To avoid "unpredictable" breaking change, I would pin to a particular > versions, and explicitly update if needed. > > 3. It's better that docker images are under an unique responsibility scope > as different IOs can use the same resources, so they should use the same > provided docker. > > By the way, I also have a docker coming for RedisIO ;) > > Regards > JB > > > On 03/15/2017 08:01 AM, Stephen Sisk wrote: > >> hi! >> >> as part of doing the work to enable IO ITs, we decided we want to use >> docker. As part of that, we need to run docker images and they'll probably >> be pulled from a docker repository. >> >> Questions: >> * What docker repositories (and users on docker hub) do we as a group >> allow >> for images we'll run for hosted data stores? >> -> My proposal is we should only use repositories/images that are >> regularly updated and that have someone saying that the images we depend >> on >> are secure. In the set of images currently linked to by checked in code/in >> PR code, quay.io and official docker images seem fine. They both have >> security scans (for what that's worth) and generally seem okay. >> >> * Do we pin to particular docker images or allow our version to float? >> -> I have seen docker images change in insecure way (e.g. switching the >> name of the password parameter, meaning that the data store was secure >> when >> set up, and became insecure because no password was set after the image >> update), so I'd prefer to pin to particular versions, and update on a >> periodic basis. >> >> I'm relatively new to docker best practices, so I'm open to suggestions on >> this. >> >> Current ITs with docker images: >> * Jdbc - https://hub.docker.com/_/postgres/ (official image) >> * Elasticsearch - https://hub.docker.com/r/sebp/elk/ (semi-official >> looking >> image) >> * (PR in-flight >> <https://github.com/apache/beam/pull/2193/files#diff-a630b5f >> ff9aebc9e99a3f324c9cf75a9R52>) >> HadoopInputFormat's elasticsearch and cassandra tests - >> https://hub.docker.com/_/cassandra/ and >> https://quay.io/repository/pires/docker-elasticsearch-kubern >> etes?tag=5.2.2&tab=tags >> (official image, and image from quay.io, which provides security audits >> of >> their images) >> >> The more I think about it, the less I'm excited about the sebp/elk image - >> I'm sure it's fine, but I'd prefer using images from a source that we know >> is trying to check for security problems. >> >> There's a secondary problem that we're using two different elasticsearch >> images - I'd like to use only one image. I'll follow up on that - >> https://issues.apache.org/jira/browse/BEAM-1644 >> >> S >> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >