Re: Hosting data stores for IO Transform testing

Jean-Baptiste Onofré Wed, 18 Jan 2017 13:10:42 -0800

Yes, for both DCOS (Mesos+Marathon) and Kubernetes, I think we may findsingle node config but not sure for multi-node setup. Anyway, I'm notsure if we find a multi-node configuration, it would cover our needs.


Regards
JB


On 01/18/2017 12:52 PM, Stephen Sisk wrote:

ah! I looked around a bit more and found the dcos package repo -
https://github.com/mesosphere/universe/tree/version-3.x/repo/packages

poking around a bit, I can find a lot of packages for single node
instances, but not many packages for multi-node instances. Single node
instance packages are kind of useful, but I don't think it's *too* helpful.
The multi-node instance packages that run the data store's high
availability mode are where the real work is, and it seems like both
kubernetes helm and dcos' package universe don't have a lot of those.

S

On Wed, Jan 18, 2017 at 9:56 AM Stephen Sisk <[email protected]> wrote:

Hi Ishmael,

these are good questions, thanks for raising them.

Ability to modify network/compute resources to simulate failures
=================================================
I see two real questions here:
1. Is this something we want to do?
2. Is it possible with both/either?

So far, the test strategy I've been advocating is that we test problems
like this in unit tests rather than do this in ITs/Perf tests. Otherwise,
it's hard to re-create the same conditions.

I can investigate whether it's possible, but I want to clarify whether
this is something that we care about. I know both support killing
individual nodes. I haven't seen a lot of network control in either, but
haven't tried to look for it.

Availability of ready to play packages
============================
I did look at this, and as far as I could tell, mesos didn't have any
pre-built packages for multi-node clusters of data stores. If there's a
good repository of them that we trust, that would definitely save us time.
Can you point me at the mesos repository?

S



On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré <[email protected]>
wrote:

⁣Hi Ismael

Stephen will reply with details but I know he did a comparison and
evaluate different options.

He tested with the jdbc Io itests.

Regards
JB

On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <[email protected]>
wrote:

Thanks for your analysis Stephen, good arguments / references.

One quick question. Have you checked the APIs of both
(Mesos/Kubernetes) to
see
if we can do programmatically do more complex tests (I suppose so, but
you
don't mention how easy or if those are possible), for example to
simulate a
slow networking slave (to test stragglers), or to arbitrarily kill one
slave (e.g. if I want to test the correct behavior of a runner/IO that
is
reading from it) ?

Other missing point in the review is the availability of ready to play
packages,
I think in this area mesos/dcos seems more advanced no? I haven't
looked
recently but at least 6 months ago there were not many helm packages
ready
for
example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
etc). Has
this been improved ? because preparing this also is a considerable
amount of
work on the other hand this could be also a chance to contribute to
kubernetes.

Regards,
Ismaël



On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <[email protected]>
wrote:

hi!

I've been continuing this investigation, and have some more info to

report,

and hopefully we can start making some decisions.

To support performance testing, I've been investigating

mesos+marathon and

kubernetes for running data stores in their high availability mode. I

have

been examining features that kubernetes/mesos+marathon use to support

this.


Setting up a multi-node cluster in a high availability mode tends to

be

more expensive time-wise than the single node instances I've played

around

with in the past. Rather than do a full build out with both

kubernetes and

mesos, I'd like to pick one of the two options to build the prototype
cluster with. If the prototype doesn't go well, we could still go

back to

the other option, but I'd like to change us from a mode of "let's

look at

all the options" to one of "here's the favorite, let's prove that

works for

us".

Below are the features that I've seen are important to multi-node

instances

of data stores. I'm sure other folks on the list have done this

before, so

feel free to pipe up if I'm missing a good solution to a problem.

DNS/Discovery

--------------------

Necessary for talking between nodes (eg, cassandra nodes all need to

be

able to talk to a set of seed nodes.)

* Kubernetes has built-in DNS/discovery between nodes.

* Mesos has supports this via mesos-dns, which isn't a part of core

mesos,

but is in dcos, which is the mesos distribution I've been using and

that I

would expect us to use.

Instances properly distributed across nodes

------------------------------------------------------------

If multiple instances of a data source end up on the same underlying

VM, we

may not get good performance out of those instances since the

underlying VM

may be more taxed than other VMs.

* Kubernetes has a beta feature StatefulSets[1] which allow for

containers

distributed so that there's one container per underlying machine (as

well

as a lot of other useful features like easy stable dns names.)

* Mesos can support this via the built in UNIQUE constraint [2]

Load balancing

--------------------

Incoming requests from users need to be distributed to the various

machines

- this is important for many data stores' high availability modes.

* Kubernetes supports easily hooking up to an external load balancer

when

on a cloud (and can be configured to work with a built-in load

balancer if

not)

* Mesos supports this via marathon-lb [3], which is an install-able

package

in DC/OS

Persistent Volumes tied to specific instances

------------------------------------------------------------

Databases often need persistent state (for example to store the data

:), so

it's an important part of running our service.

* Kubernetes StatefulSets supports this

* Mesos+marathon apps with persistent volumes supports this [4] [5]

As I mentioned above, I'd like to focus on either kubernetes or mesos

for

my investigation, and as I go further along, I'm seeing kubernetes as
better suited to our needs.

(1) It supports more of the features we want out of the box and with
StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
requires marathon-lb to be installed and mesos-dns to be configured.

(2) I'm also finding that there seem to be more examples of using
kubernetes to solve the types of problems we're working on. This is
somewhat subjective, but in my experience as I've tried to learn both
kubernetes and mesos, I personally found it generally easier to get
kubernetes running than mesos due to the tutorials/examples available

for

kubernetes.

(3) Lower cost of initial setup - as I discussed in a previous

mail[6],

kubernetes was far easier to get set up even when I knew the exact

steps.

Mesos took me around 27 steps [7], which involved a lot of config

that was

easy to get wrong (it took me about 5 tries to get all the steps

correct in

one go.) Kubernetes took me around 8 steps and very little config.

Given that, I'd like to focus my investigation/prototyping on

Kubernetes.

To
be clear, it's fairly close and I think both Mesos and Kubernetes

could

support what we need, so if we run into issues with kubernetes, Mesos

still

seems like a viable option that we could fall back to.

Thanks,
Stephen


[1] Kubernetes StatefulSets

https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/


[2] mesos unique constraint -
https://mesosphere.github.io/marathon/docs/constraints.html

[3]
https://mesosphere.github.io/marathon/docs/service-
discovery-load-balancing.html
 and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/

[4]

https://mesosphere.github.io/marathon/docs/persistent-volumes.html

[5]

https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/


[6] Container Orchestration software for hosting data stores
https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E

[7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md


On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <[email protected]>

wrote:

Just a quick drive-by comment: how tests are laid out has

non-trivial

tradeoffs on how/where continuous integration runs, and how results

are

integrated into the tooling. The current state is certainly not

ideal

(e.g., due to multiple test executions some links in Jenkins point

where

they shouldn't), but most other alternatives had even bigger

drawbacks at

the time. If someone has great ideas that don't explode the number

of

modules, please share ;-)

On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot

<[email protected]>

wrote:

Hi Stephen,

Thanks for taking the time to comment.

My comments are bellow in the email:


Le 24/12/2016 à 00:07, Stephen Sisk a écrit :

hey Etienne -

thanks for your thoughts and thanks for sharing your

experiences. I

generally agree with what you're saying. Quick comments below:

IT are stored alongside with UT in src/test directory of the IO

but

they

might go to dedicated module, waiting for a consensus
I don't have a strong opinion or feel that I've worked enough

with

maven

to
understand all the consequences - I'd love for someone with more

maven

experience to weigh in. If this becomes blocking, I'd say check

it in,

and

we can refactor later if it proves problematic.

Sure, not a blocking point, it could be refactored afterwards.

Just as

reminder, JB mentioned that storing IT in separate module allows

to

have

more coherence between all IT (same behavior) and to do cross IO
integration tests. JB, have you experienced some long term

drawbacks of

storing IT in a separate module, like, for example, more

difficult

maintenance due to "distance" with production code?

  Also IMHO, it is better that tests load/clean data than doing

some

assumptions about the running order of the tests.
I definitely agree that we don't want to make assumptions about

the

running
order of the tests - that way lies pain. :) It will be

interesting to

see

how the performance tests work out since they will need more

data (and

thus
loading data can take much longer.)

Yes, performance testing might push in the direction of data

loading

from

outside the tests due to loading time.

  This should also be an easier problem
for read tests than for write tests - if we have long running

instances,

read tests don't really need cleanup. And if write tests only

write a

small
amount of data, as long as we are sure we're writing to uniquely
identifiable locations (ie, new table per test or something

similar),

we

can clean up the write test data on a slower schedule.

I agree


this will tend to go to the direction of long running data store

instances rather than data store instances started (and

optionally

loaded)

before tests.
It may be easiest to start with a "data stores stay running"
implementation, and then if we see issues with that move towards

tests

that
start/stop the data stores on each run. One thing I'd like to

make

sure

is

that we're not manually tweaking the configurations for data

stores.

One

way we could do that is to destroy/recreate the data stores on a

slower

schedule - maybe once per week. That way if the script is

changed or

the

data store instances are changed, we'd be able to detect it

relatively

soon
while still removing the need for the tests to manage the data

stores.

I agree. In addition to configuration manual tweaking, there

might be

cases in which a data store re-partition data during a test or

after

some

tests while the dataset changes. The IO must be tolerant to that

but

the

asserts (number of bundles for example) in test must not fail in

that

case.

I would also prefer if possible that the tests do not manage data

stores

(not setup them, not start them, not stop them)

as a general note, I suspect many of the folks in the states

will be

on

holiday until Jan 2nd/3rd.

S

On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot

<[email protected]

wrote:

Hi,


Recently we had a discussion about integration tests of IOs.

I'm

preparing a PR for integration tests of the elasticSearch IO
(
https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
LASTICSEARCH-IO
as a first shot) which are very important IMHO because they

helped

catch

some bugs that UT could not (volume, data store instance

sharing,

real

data store instance ...)

I would like to have your thoughts/remarks about points bellow.

Some

of

these points are also discussed here

https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
:

- UT and IT have a similar architecture, but while UT focus on

testing

the correct behavior of the code including corner cases and use

embedded

in memory data store, IT assume that the behavior is correct

(strong

UT)

and focus on higher volume testing and testing against real

data

store

instance(s)

- For now, IT are stored alongside with UT in src/test

directory of

the

IO but they might go to dedicated module, waiting for a

consensus.

Maven

is not configured to run them automatically because data store

is not

available on jenkins server yet

- For now, they only use DirectRunner, but they will  be run

against

each runner.

- IT do not setup data store instance (like stated in the above
document) they assume that one is already running (hardcoded
configuration in test for now, waiting for a common solution to

pass

configuration to IT). A docker container script is provided in

the

contrib directory as a starting point to whatever orchestration

software

will be chosen.

- IT load and clean test data before and after each test if

needed.

It

is simpler to do so because some tests need empty data store

(write

test) and because, as discussed in the document, tests might

not be

the

only users of the data store. Also IMHO, it is better that

tests

load/clean data than doing some assumptions about the running

order

of

the tests.

If we generalize this pattern to all IT tests, this will tend

to go

to

the direction of long running data store instances rather than

data

store instances started (and optionally loaded) before tests.

Besides if we where to change our minds and load data from

outside

the

tests, a logstash script is provided.

If you have any thoughts or remarks I'm all ears :)

Regards,

Etienne

Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :

Hi Stephen,

the purpose of having in a specific module is to share

resources and

apply the same behavior from IT perspective and be able to

have IT

"cross" IO (for instance, reading from JMS and sending to

Kafka, I

think that's the key idea for integration tests).

For instance, in Karaf, we have:
- utest in each module
- itest module containing itests for all modules all together

Regards
JB

On 12/14/2016 04:59 PM, Stephen Sisk wrote:

Hi Etienne,

thanks for following up and answering my questions.

re: where to store integration tests - having them all in a

separate

module
is an interesting idea. I couldn't find JB's comments about

moving

them

into a separate module in the PR - can you share the reasons

for

doing so?
The IO integration/perf tests so it does seem like they'll

need to

be

treated in a special manner, but given that there is already

an IO

specific
module, it may just be that we need to treat all the ITs in

the IO

module
the same way. I don't have strong opinions either way right

now.


S

On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <

[email protected]>

wrote:

Hi guys,

@Stephen: I addressed all your comments directly in the PR,

thanks!

I just wanted to comment here about the docker image I used:

the

only

official Elastic image contains only ElasticSearch. But for

testing I

needed logstash (for ingestion) and kibana (not for

integration

tests,

but to easily test REST requests to ES using sense). This is

why I

use

an ELK (Elasticsearch+Logstash+Kibana) image. This one

isreleased

under
theapache 2 license.


Besides, there is also a point about where to store

integration

tests:

JB proposed in the PR to store integration tests to dedicated

module

rather than directly in the IO module (like I did).



Etienne

Le 01/12/2016 à 20:14, Stephen Sisk a écrit :

hey!

thanks for sending this. I'm very excited to see this

change. I

added some
detail-oriented code review comments in addition to what

I've

discussed
here.

The general goal is to allow for re-usable instantiation of

particular

data

store instances and this seems like a good start. Looks like

you

also have
a script to generate test data for your tests - that's

great.


The next steps (definitely not blocking your work) will be

to have

ways to
create instances from the docker images you have here, and

use

them

in the
tests. We'll need support in the test framework for that

since

it'll

be
different on developer machines and in the beam jenkins

cluster,

but

your
scripts here allow someone running these tests locally to

not have

to

worry

about getting the instance set up and can manually adjust,

so this

is

a
good incremental step.

I have some thoughts now that I'm reviewing your scripts

(that I

didn't
have previously, so we are learning this together):
* It may be useful to try and document why we chose a

particular

docker
image as the base (ie, "this is the official supported

elastic

search

docker image" or "this image has several data stores

together that

can be
used for a couple different tests")  - I'm curious as to

whether

the

community thinks that is important

One thing that I called out in the comment that's worth

mentioning

on the
larger list - if you want to specify which specific runners

a test

uses,
that can be controlled in the pom for the module. I updated

the

testing

doc

mentioned previously in this thread with a TODO to talk

about this

more. I
think we should also make it so that IO modules have that
automatically,

so

developers don't have to worry about it.

S

On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <

[email protected]>

wrote:

Stephen,

As discussed, I added injection script, docker containers

scripts

and

integration tests to the sdks/java/io/elasticsearch/contrib
<

https://github.com/apache/incubator-beam/pull/1439/files/1e7

e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
d824cefcb3ed0b9

directory in that PR:

https://github.com/apache/incubator-beam/pull/1439.

These work well but they are first shot. Do you have any

comments

about
those?

Besides I am not very sure that these files should be in the

IO

itself

(even in contrib directory, out of maven source

directories). Any

thoughts?

Thanks,

Etienne



Le 23/11/2016 à 19:03, Stephen Sisk a écrit :

It's great to hear more experiences.

I'm also glad to hear that people see real value in the

high

volume/performance benchmark tests. I tried to capture that

in

the

Testing

doc I shared, under "Reasons for Beam Test Strategy". [1]


It does generally sound like we're in agreement here. Areas

of

discussion

see:
1.  People like the idea of bringing up fresh instances for

each

test

rather than keeping instances running all the time, since

that

ensures no
contamination between tests. That seems reasonable to me.

If we

see

flakiness in the tests or we note that setting up/tearing

down

instances

is

taking a lot of time,
2. Deciding on cluster management software/orchestration

software

- I

want

to make sure we land on the right tool here since choosing

the

wrong tool
could result in administration of the instances taking more

work. I

suspect

that's a good place for a follow up discussion, so I'll

start a

separate
thread on that. I'm happy with whatever tool we choose, but

want

to

make

sure we take a moment to consider different options and have

reason for
choosing one.

Etienne - thanks for being willing to port your

creation/other

scripts
over. You might be a good early tester of whether this

system

works

well
for everyone.

Stephen

[1]  Reasons for Beam Test Strategy -

https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np

rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#

On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
<[email protected]>
wrote:

I second Etienne there.


We worked together on the ElasticsearchIO and definitely,

the

high

valuable test we did were integration tests with ES on

docker

and

high
volume.

I think we have to distinguish the two kinds of tests:
1. utests are located in the IO itself and basically they

should

cover
the core behaviors of the IO
2. itests are located as contrib in the IO (they could be

part

of

the IO
but executed by the integration-test plugin or a specific

profile)

that
deals with "real" backend and high volumes. The resources

required

by
the itest can be bootstrapped by Jenkins (for instance

using

Mesos/Marathon and docker images as already discussed, and

it's

what I'm
doing on my own "server").

It's basically what Stephen described.

We have to not relay only on itest: utests are very

important

and

they
validate the core behavior.

My $0.01 ;)

Regards
JB

On 11/23/2016 09:27 AM, Etienne Chauchot wrote:

Hi Stephen,

I like your proposition very much and I also agree that

docker

some
orchestration software would be great !

On the elasticsearchIO (PR to be created this week) there

is

docker

container creation scripts and logstash data ingestion

script

for

IT
environment available in contrib directory alongside with
integration
tests themselves. I'll be happy to make them compliant to

new

IT

environment.

What you say bellow about the need for external IT

environment

is

particularly true. As an example with ES what came out in

first

implementation was that there were problems starting at

some

high

volume

of data (timeouts, ES windowing overflow...) that could not

have

be

seen

on embedded ES version. Also there where some

particularities to

external instance like secondary (replica) shards that

where

not

visible

on embedded instance.


Besides, I also favor bringing up instances before test

because

it

allows (amongst other things) to be sure to start on a

fresh

dataset

for

the test to be deterministic.


Etienne


Le 23/11/2016 à 02:00, Stephen Sisk a écrit :

Hi,

I'm excited we're getting lots of discussion going.

There are

many

threads
of conversation here, we may choose to split some of

them off

into a
different email thread. I'm also betting I missed some

of the

questions in
this thread, so apologies ahead of time for that. Also

apologies

for

the

amount of text, I provided some quick summaries at the top

of

each

section.

Amit - thanks for your thoughts. I've responded in

detail

below.

Ismael - thanks for offering to help. There's plenty of

work

here to

go

around. I'll try and think about how we can divide up some

next

steps
(probably in a separate thread.) The main next step I

see is

deciding
between kubernetes/mesos+marathon/docker swarm - I'm

working

on

that,

but

having lots of different thoughts on what the

advantages/disadvantages

of

those are would be helpful (I'm not entirely sure of the

protocol for
collaborating on sub-projects like this.)

These issues are all related to what kind of tests we

want to

write. I
think a kubernetes/mesos/swarm cluster could support all

the

use

cases
we've discussed here (and thus should not block moving

forward

with
this),
but understanding what we want to test will help us

understand

how the
cluster will be used. I'm working on a proposed user

guide for

testing

IO

Transforms, and I'm going to send out a link to that + a

short

summary

to

the list shortly so folks can get a better sense of where

I'm

coming
from.



Here's my thinking on the questions we've raised here -

Embedded versions of data stores for testing
--------------------
Summary: yes! But we still need real data stores to test

against.


I am a gigantic fan of using embedded versions of the

various

data

stores.
I think we should test everything we possibly can using

them,

and do

the

majority of our correctness testing using embedded versions

+ the

direct

runner. However, it's also important to have at least one

test

that

actually connects to an actual instance, so we can get

coverage

for
things
like credentials, real connection strings, etc...

The key point is that embedded versions definitely can't

cover

the

performance tests, so we need to host instances if we

want to

test

that.

I consider the integration tests/performance benchmarks to

be

costly
things
that we do only for the IO transforms with large amounts

of

community
support/usage. A random IO transform used by a few users

doesn't

necessarily need integration & perf tests, but for

heavily

used

IO

transforms, there's a lot of community value in these

tests.

The

maintenance proposal below scales with the amount of

community

support
for
a particular IO transform.



Reusing data stores ("use the data stores across

executions.")

------------------
Summary: I favor a hybrid approach: some frequently

used, very

small
instances that we keep up all the time + larger

multi-container

data
store
instances that we spin up for perf tests.

I don't think we need to have a strong answer to this

question,

but I
think
we do need to know what range of capabilities we need,

and use

that to
inform our requirements on the hosting infrastructure. I

think

kubernetes/mesos + docker can support all the scenarios

discuss

below.

I had been thinking of a hybrid approach - reuse some

instances

and

don't

reuse others. Some tests require isolation from other

tests

(eg.

performance benchmarking), while others can easily

re-use the

same

database/data store instance over time, provided they

are

written in

the

correct manner (eg. a simple read or write correctness

integration

tests)

To me, the question of whether to use one instance over

time

for

test vs
spin up an instance for each test comes down to a trade

off

between

these

factors:

1. Flakiness of spin-up of an instance - if it's super

flaky,

we'll
want to
keep more instances up and running rather than bring

them

up/down.

(this

may also vary by the data store in question)

2. Frequency of testing - if we are running tests every

minutes, it

may

be wasteful to bring machines up/down every time. If we

run

tests once

day or week, it seems wasteful to keep the machines up the

whole

time.
3. Isolation requirements - If tests must be isolated,

it

means

we

either

have to bring up the instances for each test, or we have

to

have

some
sort
of signaling mechanism to indicate that a given instance

is in

use. I
strongly favor bringing up an instance per test.
4. Number/size of containers - if we need a large number

of

machines
for a
particular test, keeping them running all the time will

use

more

resources.


The major unknown to me is how flaky it'll be to spin

these

up.

I'm
hopeful/assuming they'll be pretty stable to bring up,

but I

think the
best
way to test that is to start doing it.

I suspect the sweet spot is the following: have a set of

very

small

data

store instances that stay up to support small-data-size

post-commit

end to
end tests (post-commits run frequently and the data size

means

the

instances would not use many resources), combined with

the

ability to
spin
up larger instances for once a day/week performance

benchmarks

(these

use

up more resources and are used less frequently.) That's

the mix

I'll
propose in my docs on testing IO transforms.  If

spinning up

new

instances
is cheap/non-flaky, I'd be fine with the idea of

spinning up

instances
for
each test.



Management ("what's the overhead of managing such a

deployment")

--------------------
Summary: I propose that anyone can contribute scripts

for

setting up

data

store instances + integration/perf tests, but if the

community

doesn't
maintain a particular data store's tests, we disable the

tests

and

turn off
the data store instances.

Management of these instances is a crucial question.

First,

let's

break

down what tasks we'll need to do on a recurring basis:

1. Ongoing maintenance (update to new versions, both

instance

dependencies) - we don't want to have a lot of old

versions

vulnerable

to

attacks/buggy

2. Investigate breakages/regressions
(I'm betting there will be more things we'll discover -

let me

know if
you
have suggestions)

There's a couple goals I see:
1. We should only do sys admin work for things that give

us a

lot of
benefit. (ie, don't build IT/perf/data store set up

scripts

for

data
stores
without a large community)
2. We should do as much as possible of testing via
in-memory/embedded
testing (as you brought up).
3. Reduce the amount of manual administration overhead

As I discussed above, I think that integration

tests/performance

benchmarks
are costly things that we should do only for the IO

transforms

with

large

amounts of community support/usage. Thus, I propose that

we

limit the

IO

transforms that get integration tests & performance

benchmarks to

those

that have community support for maintaining the data store

instances.

We can enforce this organically using some simple rules:
1. Investigating breakages/regressions: if a given
integration/perf

test

starts failing and no one investigates it within a set

period of

time

(a

week?), we disable the tests and shut off the data store

instances if

we

have instances running. When someone wants to step up and

support it
again,
they can fix the test, check it in, and re-enable the

test.

2. Ongoing maintenance: every N months, file a jira

issue that

is just
"is
the IO Transform X data store up to date?" - if the jira

is

not

resolved in
a set period of time (1 month?), the perf/integration

tests

are

disabled,

and the data store instances shut off.


This is pretty flexible -
* If a particular person or organization wants to

support an

IO

transform,
they can. If a group of people all organically organize

to

keep

the

tests

running, they can.

* It can be mostly automated - there's not a lot of

central

organizing
work
that needs to be done.

Exposing the information about what IO transforms

currently

have

running

IT/perf benchmarks on the website will let users know what

IO

transforms

are well supported.


I like this solution, but I also recognize this is a

tricky

problem.

This

is something the community needs to be supportive of, so

I'm

open to
other
thoughts.


Simulating failures in real nodes ("programmatic tests

to

simulate

failure")
-----------------
Summary: 1) Focus our testing on the code in Beam 2) We

should

encourage a
design pattern separating out network/retry logic from

the

main

IO

transform logic


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Hosting data stores for IO Transform testing

Reply via email to