Re: Docker image dependencies

2017-03-22 Thread Stephen Sisk
hey Ismael,
I appreciate you asking questions to make sure we're doing the right things
to help developers out and making sure it's easy to add more IO ITs.

I strongly agree that we need to make sure we have documentation that
clearly lays out how to get it working.

The setup & teardown scripts for postgres in jdbc's
src/test/resources/kubernetes directory *should* work on a vanilla
kubernetes cluster (it's how I setup them up) - I deliberately did not do
anything fancy when creating my kubernetes cluster. Probably the only thing
that I know of that might be tricky is that the script is currently set up
so that only exposes the postgres service on Node(vm) IPs - that probably
needs documentation on how to use it with the tests. (basically, you should
be able to take the IP address of any of the k8 VMs and use that as the IP
address of postgres - k8 will proxy that over to the correct container.)

I added a few rough notes here in the testing doc:
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.l06g9u1ejw4l


I'm definitely interested to hear what questions you have.

S

On Wed, Mar 22, 2017 at 3:56 PM Ismaël Mejía  wrote:

> You have really good points, I agree 100%, docker is easier if it is
> local, once we talk about distributions all of them has their
> pros/cons. I don’t intend to re open the discussion and of course it
> would be silly to go back and remake all the work you already have
> done.
>
> We already agreed on kubernetes and this is it. My point mentioning
> docker-compose was more from the we need to make the life of IT tests
> contributors easier, and maybe adding an extra tool is not the way,
> but at least we will need better documentation or references to help
> developers bootstrap their Kubernetes so they can contribute and
> validate the tests in their own.
>
> On Wed, Mar 22, 2017 at 12:14 AM, Stephen Sisk 
> wrote:
> > Hey Ismael,
> >
> > I definitely agree with you that we want something that developers will
> > actually be able to/want to use.
> >
> > in my experience *all* the container orchestration engines are
> non-trivial
> > to set up. When I started examining solutions for beam hosting, I did
> > installs of mesos, kubernetes and docker. Docker is easier in the "run
> only
> > on my local machine" case if devs have it set up, but to do anything
> > interesting (ie, interact with machines that aren't already yours), they
> > all involve work to get them setup on each machine you want to use[4].
> >
> > Kubernetes has some options that make it extremely simple to setup - both
> > AWS[2] and GCE[3] seem to be straightforward to set up for simple dev
> > clusters, with scripts to automate the process (I'm assuming docker has
> > similar setups.)
> >
> > Once kubernetes is set up, it's also a simple yaml file + command to set
> up
> > multiple machines. The kubernetes setup for postgres[5] shows a simple
> one
> > machine example, and the kubernetes setups for HIFIO[6] show
> multi-machine
> > examples.
> >
> > We've spent a lot of time discussing the various options - when we talked
> > about this earlier [1] we decided we would move forward with
> investigating
> > kubernetes, so that's what I used for the IO ITs work I've been doing,
> > which we've now gotten working.
> >
> > Do you feel the advantages of docker are such that we should re-open the
> > discussion and potentially re-do the work we've done so far to get k8
> > working?
> >
> > I took a genuine look at docker earlier in the process and it didn't seem
> > like it was better than the other options in any dimensions (other than
> > "developers usually have it installed already"), and kubernetes/mesos
> > seemed to be more stable/have more of the features discussed in [1].
> > Perhaps that's changed?
> >
> > I think we are just starting to use container orchestration engines, and
> so
> > while I don't want to throw away the work we've done so far, I also don't
> > want to have to do it later if there are reasons we knew about now. :)
> >
> > S
> >
> > [1]
> >
> https://lists.apache.org/thread.html/9fd3c51cb679706efa4d0df2111a6ac438b851818b639aba644607af@%3Cdev.beam.apache.org%3E
> >
> > [2] k8 AWS - https://kubernetes.io/docs/getting-started-guides/aws/
> > [3] k8 GKE - https://cloud.google.com/container-engine/docs/quickstart
> or
> > https://kubernetes.io/docs/getting-started-guides/gce/
> > [4] docker swarm on GCE -
> >
> https://rominirani.com/docker-swarm-on-google-compute-engine-364765b400ed#.gzvruzis9
> >
> > [5] postgres k8 script -
> >
> https://github.com/apache/beam/tree/master/sdks/java/io/jdbc/src/test/resources/kubernetes
> >
> > [6]
> >
> https://github.com/diptikul/incubator-beam/tree/HIFIO-CS-ES/sdks/java/io/hadoop/jdk1.8-tests/src/test/resources/kubernetes
> >
> >
> > On Mon, Mar 20, 2017 at 3:25 PM Ismaël Mejía  wrote:
> >
> > I have somehow forgotten this one.
> >
> >> Basically - 

Re: Docker image dependencies

2017-03-15 Thread Stephen Sisk
thanks for the discussion! In general, I agree with the sentiments
expressed here. I updated
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.hlirex1vus1a
to
reflect this discussion. (The plan is still that I will put that on the
website.)

Apache Docker Repository - are you talking about
https://hub.docker.com/u/apache/ ? If not, can you point me at more info? I
can't seem to find info about this on the publicly visible apache-infra
mailing lists thatI could find, and the apache infra website doesn't seem
to mention a docker repository.



> However the current Beam Elasticsearch IO does not support Elasticsearch
5, and elastic does not have an image for version 2, so in this particular case
following the priority order we should use the official docker image (2)
for the tests (assuming that both require the same version). ​ Do you agree
with this ?​

Yup, that makes sense to me.



> How do we deal with IOs that require more than one base image, this is a  
> common
scenario for projects that depend on Zookeeper?

Is there a reason not to just run a kubernetes ReplicaController+Service
for these cases? k8 can easily support having a hostname that pods can rely
on having the zookeeper instance. It also uses text config - see
https://github.com/apache/beam/tree/master/sdks/java/io/jdbc/src/test/resources/kubernetes,
and sets up the connections/nameservice between the hosts - if other tests
wanted to rely on postgres, it could just connect to host "postgres" and
postgres is there.

Basically - I'm trying to keep number of tools at a minimum while still
providing good support for the functionality we need. Does docker-compose
provide something beyond the functionality that k8 does? I'm not familiar
with docker-compose, but looking at
https://docs.docker.com/compose/overview/#compose-documentation it doesn't
seem to provide anything that k8 doesn't already.


S

On Wed, Mar 15, 2017 at 7:10 AM Ismaël Mejía  wrote:

Hi, Thanks for bringing this subject to the mailing list.

+1
We definitely need a consensus on this, and I agree with your proposal and
JB’s comments modulo certain clarifications:

I think we shall go in this priority order if the version of the image we
want is available:

1. Image provided by the creator of the data source/sink (if they
officially maintain it). (This is the case of Elasticsearch for example) or
the Apache projects (if they provide one) as JB mentions.
2. Official docker images (because they have security fixes and have
guaranteed maintenance.
3. Non-official docker images or images from other providers that have good
maintainers e.g. quay.io

It makes sense to use the same image for all the tests. and to use the
fixed versions supported by the respective IO to avoid possible issues
during testing between different versions/naming of env variables, etc.

The Elasticsearch case is a 'good' example because it shows all the current
issues:

We should not use one elasticsearch image (elk) for some tests and a
different one for the other (the quay one), and if we resolve by priority
we would take the image provided by the creator (1) for both cases.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
However the current Beam Elasticsearch IO does not support Elasticsearch 5,
and elastic does not have an image for version 2, so in this particular
case following the priority order we should use the official docker image
(2) for the tests (assuming that both require the same version).
​ Do you agree with this ?​


Thinking about the ELK image I came with a new question. How do we deal
with IOs that require more than one base image, this is a common scenario
for projects that depend on Zookeeper? e.g. Kafka/Solr.  Usually people
coordinate those with a docker-compose file that creates an artificial
network to connect the Zookeeper image and the Kafka/Solr one
​ just executing the 'docker-compose up' command​
. Will we adopt this for such cases ?

I know that Kubernetes does this too, but the docker-compose format is
quite easy and textual,
​and it is usually ready with the docker installation, additionally the
docker-compose files can easily be translated with kompose into Kubernetes
resources.

Ismaël

On Wed, Mar 15, 2017 at 3:17 AM, Jean-Baptiste Onofré 
wrote:

> Hi Stephen,
>
> 1. About the docker repositories, we now have official Docker repo at
> Apache. So, for the Apache projects, I would recommend the Apache official
> repo. Anyway, generally speaking, I would recommend the official repo
(from
> the projects).
>
> 2. To avoid "unpredictable" breaking change, I would pin to a particular
> versions, and explicitly update if needed.
>
> 3. It's better that docker images are under an unique responsibility scope
> as different IOs can use the same resources, so they should use the same
> provided docker.
>
> By the way, I also have a docker coming for RedisIO ;)
>
> Regards
> 

Re: Docker image dependencies

2017-03-15 Thread Ismaël Mejía
Hi, Thanks for bringing this subject to the mailing list.

+1
We definitely need a consensus on this, and I agree with your proposal and
JB’s comments modulo certain clarifications:

I think we shall go in this priority order if the version of the image we
want is available:

1. Image provided by the creator of the data source/sink (if they
officially maintain it). (This is the case of Elasticsearch for example) or
the Apache projects (if they provide one) as JB mentions.
2. Official docker images (because they have security fixes and have
guaranteed maintenance.
3. Non-official docker images or images from other providers that have good
maintainers e.g. quay.io

It makes sense to use the same image for all the tests. and to use the
fixed versions supported by the respective IO to avoid possible issues
during testing between different versions/naming of env variables, etc.

The Elasticsearch case is a 'good' example because it shows all the current
issues:

We should not use one elasticsearch image (elk) for some tests and a
different one for the other (the quay one), and if we resolve by priority
we would take the image provided by the creator (1) for both cases.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
However the current Beam Elasticsearch IO does not support Elasticsearch 5,
and elastic does not have an image for version 2, so in this particular
case following the priority order we should use the official docker image
(2) for the tests (assuming that both require the same version).
​ Do you agree with this ?​


Thinking about the ELK image I came with a new question. How do we deal
with IOs that require more than one base image, this is a common scenario
for projects that depend on Zookeeper? e.g. Kafka/Solr.  Usually people
coordinate those with a docker-compose file that creates an artificial
network to connect the Zookeeper image and the Kafka/Solr one
​ just executing the 'docker-compose up' command​
. Will we adopt this for such cases ?

I know that Kubernetes does this too, but the docker-compose format is
quite easy and textual,
​and it is usually ready with the docker installation, additionally the
docker-compose files can easily be translated with kompose into Kubernetes
resources.

Ismaël

On Wed, Mar 15, 2017 at 3:17 AM, Jean-Baptiste Onofré 
wrote:

> Hi Stephen,
>
> 1. About the docker repositories, we now have official Docker repo at
> Apache. So, for the Apache projects, I would recommend the Apache official
> repo. Anyway, generally speaking, I would recommend the official repo (from
> the projects).
>
> 2. To avoid "unpredictable" breaking change, I would pin to a particular
> versions, and explicitly update if needed.
>
> 3. It's better that docker images are under an unique responsibility scope
> as different IOs can use the same resources, so they should use the same
> provided docker.
>
> By the way, I also have a docker coming for RedisIO ;)
>
> Regards
> JB
>
>
> On 03/15/2017 08:01 AM, Stephen Sisk wrote:
>
>> hi!
>>
>> as part of doing the work to enable IO ITs, we decided we want to use
>> docker. As part of that, we need to run docker images and they'll probably
>> be pulled from a docker repository.
>>
>> Questions:
>> * What docker repositories (and users on docker hub) do we as a group
>> allow
>> for images we'll run for hosted data stores?
>>  -> My proposal is we should only use repositories/images that are
>> regularly updated and that have someone saying that the images we depend
>> on
>> are secure. In the set of images currently linked to by checked in code/in
>> PR code, quay.io and official docker images seem fine. They both have
>> security scans (for what that's worth) and generally seem okay.
>>
>> * Do we pin to particular docker images or allow our version to float?
>>  -> I have seen docker images change in insecure way (e.g. switching the
>> name of the password parameter, meaning that the data store was secure
>> when
>> set up, and became insecure because no password was set after the image
>> update), so I'd prefer to pin to particular versions, and update on a
>> periodic basis.
>>
>> I'm relatively new to docker best practices, so I'm open to suggestions on
>> this.
>>
>> Current ITs with docker images:
>> * Jdbc - https://hub.docker.com/_/postgres/  (official image)
>> * Elasticsearch - https://hub.docker.com/r/sebp/elk/ (semi-official
>> looking
>> image)
>> * (PR in-flight
>> > ff9aebc9e99a3f324c9cf75a9R52>)
>> HadoopInputFormat's elasticsearch and cassandra tests -
>> https://hub.docker.com/_/cassandra/ and
>> https://quay.io/repository/pires/docker-elasticsearch-kubern
>> etes?tag=5.2.2=tags
>> (official image, and image from quay.io, which provides security audits
>> of
>> their images)
>>
>> The more I think about it, the less I'm excited about the sebp/elk image -
>> I'm sure it's fine, but I'd prefer using images from a source 

Re: Docker image dependencies

2017-03-14 Thread Jean-Baptiste Onofré

Hi Stephen,

1. About the docker repositories, we now have official Docker repo at Apache. 
So, for the Apache projects, I would recommend the Apache official repo. Anyway, 
generally speaking, I would recommend the official repo (from the projects).


2. To avoid "unpredictable" breaking change, I would pin to a particular 
versions, and explicitly update if needed.


3. It's better that docker images are under an unique responsibility scope as 
different IOs can use the same resources, so they should use the same provided 
docker.


By the way, I also have a docker coming for RedisIO ;)

Regards
JB

On 03/15/2017 08:01 AM, Stephen Sisk wrote:

hi!

as part of doing the work to enable IO ITs, we decided we want to use
docker. As part of that, we need to run docker images and they'll probably
be pulled from a docker repository.

Questions:
* What docker repositories (and users on docker hub) do we as a group allow
for images we'll run for hosted data stores?
 -> My proposal is we should only use repositories/images that are
regularly updated and that have someone saying that the images we depend on
are secure. In the set of images currently linked to by checked in code/in
PR code, quay.io and official docker images seem fine. They both have
security scans (for what that's worth) and generally seem okay.

* Do we pin to particular docker images or allow our version to float?
 -> I have seen docker images change in insecure way (e.g. switching the
name of the password parameter, meaning that the data store was secure when
set up, and became insecure because no password was set after the image
update), so I'd prefer to pin to particular versions, and update on a
periodic basis.

I'm relatively new to docker best practices, so I'm open to suggestions on
this.

Current ITs with docker images:
* Jdbc - https://hub.docker.com/_/postgres/  (official image)
* Elasticsearch - https://hub.docker.com/r/sebp/elk/ (semi-official looking
image)
* (PR in-flight
)
HadoopInputFormat's elasticsearch and cassandra tests -
https://hub.docker.com/_/cassandra/ and
https://quay.io/repository/pires/docker-elasticsearch-kubernetes?tag=5.2.2=tags
(official image, and image from quay.io, which provides security audits of
their images)

The more I think about it, the less I'm excited about the sebp/elk image -
I'm sure it's fine, but I'd prefer using images from a source that we know
is trying to check for security problems.

There's a secondary problem that we're using two different elasticsearch
images - I'd like to use only one image. I'll follow up on that -
https://issues.apache.org/jira/browse/BEAM-1644

S



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Docker image dependencies

2017-03-14 Thread Stephen Sisk
hi!

as part of doing the work to enable IO ITs, we decided we want to use
docker. As part of that, we need to run docker images and they'll probably
be pulled from a docker repository.

Questions:
* What docker repositories (and users on docker hub) do we as a group allow
for images we'll run for hosted data stores?
 -> My proposal is we should only use repositories/images that are
regularly updated and that have someone saying that the images we depend on
are secure. In the set of images currently linked to by checked in code/in
PR code, quay.io and official docker images seem fine. They both have
security scans (for what that's worth) and generally seem okay.

* Do we pin to particular docker images or allow our version to float?
 -> I have seen docker images change in insecure way (e.g. switching the
name of the password parameter, meaning that the data store was secure when
set up, and became insecure because no password was set after the image
update), so I'd prefer to pin to particular versions, and update on a
periodic basis.

I'm relatively new to docker best practices, so I'm open to suggestions on
this.

Current ITs with docker images:
* Jdbc - https://hub.docker.com/_/postgres/  (official image)
* Elasticsearch - https://hub.docker.com/r/sebp/elk/ (semi-official looking
image)
* (PR in-flight
)
HadoopInputFormat's elasticsearch and cassandra tests -
https://hub.docker.com/_/cassandra/ and
https://quay.io/repository/pires/docker-elasticsearch-kubernetes?tag=5.2.2=tags
(official image, and image from quay.io, which provides security audits of
their images)

The more I think about it, the less I'm excited about the sebp/elk image -
I'm sure it's fine, but I'd prefer using images from a source that we know
is trying to check for security problems.

There's a secondary problem that we're using two different elasticsearch
images - I'd like to use only one image. I'll follow up on that -
https://issues.apache.org/jira/browse/BEAM-1644

S