Re: Some Thoughts on IO Integration Tests

Jean-Baptiste Onofré Fri, 13 Jan 2017 00:15:01 -0800

Hi guys,

sorry I'm a bit late on this thread.

I think we might move forward step by step. I started to prepare theITests for Reddis and RabbitMQ IOs. I will submit the PRs.

Agree with Stephen, we got an agreement, I think it makes sense to moveforward.

Stephen, would you have some time next week to discuss/move forwardabout that ?


Thanks !
Regards
JB

On 01/13/2017 03:53 AM, Stephen Sisk wrote:

Thanks, that helps. I've been thinking of this mostly for connection
options which I'm hoping should be pretty generic.

Do we have known examples that we will want to use this with today, or is
this a hypothetical? If it's hypothetical for now, would it make sense to
defer making changes to TestPipeline?

I suspect I'm over-fixating too much on one part - basically sounds like
we've got an agreement. I'm excited about this, and I'll plan on having ITs
that assume they get connection info from pipeline options.

S

On Thu, Jan 12, 2017 at 6:04 PM Lukasz Cwik <[email protected]>
wrote:

Imagine you had a KafkaReadIT and a KafkaWriteIT.
Both ITs share the same Kafka configuration options (host, topic, ...)
found in KafkaITPipelineOptions yet the KafkaReadIT also needs to have an
additional parameter (like position to start reading kafka topic from)
found in KafkaReadITPipelineOptions that extends PipelineOptions.


On Thu, Jan 12, 2017 at 5:56 PM, Stephen Sisk <[email protected]>
wrote:

I see the need for/like KinesisIOTestPipelineOptions - it would allow all
ITs/perf tests that need a kinesis connection to get one.

What problem are we trying to solve with KinesisIOTest1PipelineOptions ?

This presumes that there would also be KinesisIOTest2PipelineOptions ?

What

would be different between the two? Maybe if we had a scenario and used
"real" names (ie, what we'd actually call the variables)

S

On Thu, Jan 12, 2017 at 5:47 PM Jason Kuster <[email protected].
invalid>
wrote:

Stephen, I agree with you about option 2 -- I think it strikes the

right

balance. To your question at the end, Luke essentially answered it, but

to

elaborate: we could make e.g. KinesisIOTestPipelineOptions which

extends

TestPipelineOptions and then KinesisIOTest1PipelineOptions which

extends

KinesisIOTestPipelineOptions (whew, that's a workout for the fingers).

It

requires a change to how we parse PipelineOptions inside of
TestPipeline.java which means we lose a bit of user-friendliness, but
enables this scenario which I think is worth the tradeoff.

On Tue, Jan 10, 2017 at 4:23 PM, Lukasz Cwik <[email protected]

wrote:

PipelineOptions allows you to aggregate common options within a

parent

interface with child interfaces extending and inheriting those

options.


For example you can have MyIOOptionsTest1 which extends MyIOOptions

and

also MyIOOptionsTest2 which extends MyIOOptions. MyIOOptions can have

all

the shared fields such as port or host or whatever and then
MyIOOptionsTest1 can have specific parameters for Test1 and
MyIOOptionsTest2 can have specific parameters for Test2.


On Tue, Jan 10, 2017 at 4:11 PM, Stephen Sisk

<[email protected]

wrote:

thanks Jason for doing this research! There are a lot of options.

You mentioned "Steven" - I assume you're talking about me :)

As you mention in your goals, we want this to work for both

individual

devs

working on a particular store, as well as the beam CI system. I'm

inclined

to favor making this pretty easy for individual devs, as long as it

doesn't

come at the cost of making CI config painful. Given that, I suspect

that

we'd want something that's straightforward to discover how to

configure,

that there's an obvious error message if it's misconfigured, and is

easy

to

learn.

In that world, I would learn towards options 2 & 3. I think a file

that

has

to be maintained/configured separately outside of the repository

isn't

the

most developer friendly option. That may be somewhat of a workflow

thing.

And if it was super-important to keep the passwords secret, then a

file

outside of the repository might be the only option that would work

however, I think we can simplify the problem and secure our test

servers

by

keeping them on a private network. (I don't think we are an

important

target, so I'm not too worried about defense in depth)

Between options 2 & 3 I would favor option 2 - I think having

defaults

pre-set to work in the beam CI environment will lead to a lot of

potential

confusion on the part of developers, since it would let them run

the

test,

and then only give them errors about connecting to a server. I

think

it'd

be better to get errors about not having pipeline options set, or

errors

about obviously wrong/default pipeline option values, rather than

"correct,

but not for your environment" settings. For example, if we have

valid

defaults, and a user only sets some of the values (ie, sets IP, but

not

port), it'd be trickier to detect when the value being used is

valid.

I'd

want the test to be able to show users an error that the option

wasn't

set,

rather than have it try to connect and timeout b/c the port isn't

correct.


I also don't think that having per-IT options is a good idea. If 2
different ITs can share a data store, we should be able to pass

them

the

same config - that also makes me strongly favor an option where we

have a

shared set of options. Do we *have* to make it part of the
TestPipelineOptions? Can we make something like a

BeamIOTestPipelineOption?


This generally leads me to favor option 2.

S

On Tue, Jan 10, 2017 at 3:39 PM Jason Kuster <

[email protected]

invalid>
wrote:

Hi all,

Following up on some of the discussions already on-list, I wanted

to

solicit some more feedback about some implementation details

regarding

the

IO Integration Tests.

As it currently stands, we mostly have IO ITs for GCP-based IO,

which

our

GCP-based Jenkins executors handle natively, but as our

integration

test

coverage begins to expand, we're going to run into several of the

problems

relevant to what Steven is doing with hosting data stores for use

by

ITs. I

wanted to get people's feedback about how to handle passing

credentials

to

the ITs. We have a couple options, motivated by some goals.

Goals:

* Has to work in Apache Beam CI environment.
* Has to run on dev machines (w/o access to beam CI environment).
* One way of passing datastore config only.
* An individual IT fails fast if run and it doesn't have valid

config.

* IO performance tests will have a validation component (this

implies

we

need to run the IO ITs, not just the IO IT pipelines).
* Devs working on an individual IO transform can run Integration

perf

tests without recreating the data stores every time
* Devs working on a runner's IO can run all the IO integration &

perf

tests. They may have to recreate the data stores every time (or

possibly

have a manual config that helps with this.) It's okay if this

world

is

bit awkward, it just needs to be possible.


Option 1: IO Configuration File

The first option is to read all credentials from some file stored

on

disk.

We can define a location for an (xml, json, yaml, etc) file which

we

can

read in each IT to find the credentials that IT needs. This

method

has

couple of upsides and a couple of downsides.

* Upsides
    * Passing credentials to ITs, and adding new credentials, is

relatively

easy.
    * Individual users can spin up their own data stores, put the
credentials in the file, run their ITs and have things just work.
* Downsides
    * Relying on a file, especially a file not checked in to the

repository

(to prevent people from accidentally sharing credentials to their

data

store, etc) is fragile and can lead to some confusing failure

cases.

    * ITs are supposed to be self-contained; using a file on disk

makes

things like running them in CI harder.
    * It seems like datastore location, username, and password

are

things

that are a better fit for the IT PipelineOptions anyway.


Option 2: TestPipelineOptions

Another option is to specify them as general PipelineOptions on
TestPipelineOptions and then to build the specific IT's options

from

there.

For example, say we have MyStoreIT1, MyStoreIT2 and MyStoreIT3.

We

could

specify inside of TestPipelineOptions some options like

"MyStoreIP",

"MyStoreUsername", and "MyStorePassword", and then the command

for

invoking

them would look like (omitting some irrelevant things):

mvn clean verify -DskipITs=false -DbeamTestPipelineOptions='[...,
"--MyStoreIP=1.2.3.4", "--MyStoreUsername=beamuser",
"--MyStorePassword=hunter2"]'.
* Upsides
    * Test is self-contained -- no dependency on an external file

and

all

relevant things can be specified on the command line; easier for

users

and

CI.
    * Passing credentials to ITs via pipelineoptions feels

better.

* Downsides
    * Harder to pass different credentials to one specific IT;

e.g. I

want

MyStoreIT1 and 2 to run against 1.2.3.4, but MyStoreIT3 to run

against

9.8.7.6.
    * Investing in this pattern means a proliferation of
TestPipelineOptions. Potentially bad, especially for a CI suite

running a

large number of ITs -- size of command line args may become

unmanageable

with 5+ data stores.


Option 3: Individual IT Options

The last option I can think of is to specify the options directly

on

the

IT's options, e.g. MyStoreIT1Options, and set defaults which work

well

for

CI. This means that CI could run an entire suite of ITs without

specifying

any arguments and trusting that the ITs' defaults will work, but

means

an

individual developer is potentially able to run only one IT at a

time,

since it will be impossible to override all options from the

command

line.

* Upsides
    * Test is still self-contained, and even more so -- possible

to

specify

args targeted at one IT in particular.
    * Args are specified right where they're used; way smaller

chance

of

confusion or mistakes.
    * Easiest for CI -- as long as defaults for data store auth

and

location are correct from the perspective of the Jenkins

executor,

it

can

essentially just turn all ITs on and run them as is.
* Downsides
    * Hardest for individual developers to run an entire suite of

ITs

--

since defaults are configured for running in CI environment, they

will

likely fail when running on the user's machine, resulting in

annoyance

for

the user.


If anyone has thoughts on these, please let me know.

Best,

Jason

--
-------
Jason Kuster
Apache Beam (Incubating) / Google Cloud Dataflow




--
-------
Jason Kuster
Apache Beam (Incubating) / Google Cloud Dataflow


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Some Thoughts on IO Integration Tests

Reply via email to