Re: Hosting data stores for IO Transform testing

Jean-Baptiste Onofré Tue, 22 Nov 2016 19:18:21 -0800

Hi Ismaël,

FYI, we also test the IOs on spark and flink small clusters (not yetapex): it's where I'm using Mesos/Marathon.

It's not a large cluster, but the integration tests are performed (byhand) on clusters.

We already discussed with Stephan and Jason to use Marathon JSON andMesos docker images bootstrapped by Jenkins for the itests.


Regards
JB

On 11/22/2016 04:58 PM, Ismaël Mejía wrote:

Hello,

@Stephen Thanks for your proposal, it is really interesting, I would really
like to help with this. I have never played with Kubernetes but this seems
a really nice chance to do something useful with it.

We (at Talend) are testing most of the IOs using simple container images
and in some particular cases ‘clusters’ of containers using docker-compose
(a little bit like Amit’s (2) proposal). It would be really nice to have
this at the Beam level, in particular to try to test more complex
semantics, I don’t know how programmable kubernetes is to achieve this for
example:

Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to
have programmatic tests to simulate failure (e.g. kill a node), or simulate
a really slow node, to ensure that the IO behaves as expected in the Beam
pipeline for the given runner.

Another related idea is to improve IO consistency: Today the different IOs
have small differences in their failure behavior, I really would like to be
able to predict with more precision what will happen in case of errors,
e.g. what is the correct behavior if I am writing to a Kafka node and there
is a network partition, does the Kafka sink retries or no ? and what if it
is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do
we guarantee exactly once writes somehow?, today I am not sure about what
happens (or if the expected behavior depends on the runner), but well maybe
it is just that I don’t know and we have tests to ensure this.

Of course both are really hard problems, but I think with your proposal we
can try to tackle them, as well as the performance ones. And apart of the
data stores, I think it will be also really nice to be able to test the
runners in a distributed manner.

So what is the next step? How do you imagine such integration tests? ? Who
can provide the test machines so we can mount the cluster?

Maybe my ideas are a bit too far away for an initial setup, but it will be
really nice to start working on this.

Ismael


On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <[email protected]> wrote:

Hi Stephen,

I was wondering about how we plan to use the data stores across executions.

Clearly, it's best to setup a new instance (container) for every test,
running a "standalone" store (say HBase/Cassandra for example), and once
the test is done, teardown the instance. It should also be agnostic to the
runtime environment (e.g., Docker on Kubernetes).
I'm wondering though what's the overhead of managing such a deployment
which could become heavy and complicated as more IOs are supported and more
test cases introduced.

Another way to go would be to have small clusters of different data stores
and run against new "namespaces" (while lazily evicting old ones), but I
think this is less likely as maintaining a distributed instance (even a
small one) for each data store sounds even more complex.

A third approach would be to to simply have an "embedded" in-memory
instance of a data store as part of a test that runs against it (such as an
embedded Kafka, though not a data store).
This is probably the simplest solution in terms of orchestration, but it
depends on having a proper "embedded" implementation for an IO.

Does this make sense to you ? have you considered it ?

Thanks,
Amit

On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <[email protected]>
wrote:

Hi Stephen,

as already discussed a bit together, it sounds great ! Especially I like
it as a both integration test platform and good coverage for IOs.

I'm very late on this but, as said, I will share with you my Marathon
JSON and Mesos docker images.

By the way, I started to experiment a bit kubernetes and swamp but it's
not yet complete. I will share what I have on the same github repo.

Thanks !
Regards
JB

On 11/16/2016 11:36 PM, Stephen Sisk wrote:

Hi everyone!

Currently we have a good set of unit tests for our IO Transforms -

those

tend to run against in-memory versions of the data stores. However,

we'd

like to further increase our test coverage to include running them

against

real instances of the data stores that the IO Transforms work against

(e.g.

cassandra, mongodb, kafka, etc…), which means we'll need to have real
instances of various data stores.

Additionally, if we want to do performance regression detection, it's
important to have instances of the services that behave realistically,
which isn't true of in-memory or dev versions of the services.


Proposed solution
-------------------------
If we accept this proposal, we would create an infrastructure for

running

real instances of data stores inside of containers, using container
management software like mesos/marathon, kubernetes, docker swarm, etc…

to

manage the instances.

This would enable us to build integration tests that run against those

real

instances and performance tests that run against those real instances

(like

those that Jason Kuster is proposing elsewhere.)


Why do we need one centralized set of instances vs just having various
people host their own instances?
-------------------------
Reducing flakiness of tests is key. By not having dependencies from the
core project on external services/instances of data stores we have
guaranteed access to the services and the group can fix issues that

arise.


An exception would be something that has an ops team supporting it (eg,
AWS, Google Cloud or other professionally managed service) - those we

trust

will be stable.


There may be a lot of different data stores needed - how will we

maintain

them?
-------------------------
It will take work above and beyond that of a normal set of unit tests

to

build and maintain integration/performance tests & their data store
instances.

Setup & maintenance of the data store containers and data store

instances

on it must be automated. It also has to be as simple of a setup as
possible, and we should avoid hand tweaking the containers - expecting
checked in scripts/dockerfiles is key.

Aligned with the community ownership approach of Apache, as members of

the

community are excited to contribute & maintain those tests and the
integration/performance tests, people will be able to step up and do

that.

If there is no longer support for maintaining a particular set of
integration & performance tests and their data store instances, then we

can

disable those tests. We may document on the website what IO Transforms

have

current integration/performance tests so users know what level of

testing

the various IO Transforms have.


What about requirements for the container management software itself?
-------------------------
* We should have the data store instances themselves in Docker. Docker
allows new instances to be spun up in a quick, reproducible way and is
fairly platform independent. It has wide support from a variety of
different container management services.
* As little admin work required as possible. Crashing instances should

be

restarted, setup should be simple, everything possible should be
scripted/scriptable.
* Logs and test output should be on a publicly available website,

without

needing to log into test execution machine. Centralized capture of
monitoring info/logs from instances running in the containers would

support

this. Ideally, this would just be supported by the container software

out

of the box.
* It'd be useful to have good persistent volume in the container

management

software so that databases don't have to reload large data sets every

time.

* The containers may be a place to execute runners themselves if we

need

larger runner instances, so it should play well with Spark, Flink, etc…

As I discussed earlier on the mailing list, it looks like hosting

docker

containers on kubernetes, docker swarm or mesos+marathon would be a

good

solution.

Thanks,
Stephen Sisk


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Hosting data stores for IO Transform testing

Reply via email to