Re: Some Thoughts on IO Integration Tests

Jason Kuster Thu, 12 Jan 2017 17:47:21 -0800

Stephen, I agree with you about option 2 -- I think it strikes the right
balance. To your question at the end, Luke essentially answered it, but to
elaborate: we could make e.g. KinesisIOTestPipelineOptions which extends
TestPipelineOptions and then KinesisIOTest1PipelineOptions which extends
KinesisIOTestPipelineOptions (whew, that's a workout for the fingers). It
requires a change to how we parse PipelineOptions inside of
TestPipeline.java which means we lose a bit of user-friendliness, but
enables this scenario which I think is worth the tradeoff.


On Tue, Jan 10, 2017 at 4:23 PM, Lukasz Cwik <[email protected]>
wrote:

> PipelineOptions allows you to aggregate common options within a parent
> interface with child interfaces extending and inheriting those options.
>
> For example you can have MyIOOptionsTest1 which extends MyIOOptions and
> also MyIOOptionsTest2 which extends MyIOOptions. MyIOOptions can have all
> the shared fields such as port or host or whatever and then
> MyIOOptionsTest1 can have specific parameters for Test1 and
> MyIOOptionsTest2 can have specific parameters for Test2.
>
>
> On Tue, Jan 10, 2017 at 4:11 PM, Stephen Sisk <[email protected]>
> wrote:
>
> > thanks Jason for doing this research! There are a lot of options.
> >
> > You mentioned "Steven" - I assume you're talking about me :)
> >
> > As you mention in your goals, we want this to work for both individual
> devs
> > working on a particular store, as well as the beam CI system. I'm
> inclined
> > to favor making this pretty easy for individual devs, as long as it
> doesn't
> > come at the cost of making CI config painful. Given that, I suspect that
> > we'd want something that's straightforward to discover how to configure,
> > that there's an obvious error message if it's misconfigured, and is easy
> to
> > learn.
> >
> > In that world, I would learn towards options 2 & 3. I think a file that
> has
> > to be maintained/configured separately outside of the repository isn't
> the
> > most developer friendly option. That may be somewhat of a workflow thing.
> > And if it was super-important to keep the passwords secret, then a file
> > outside of the repository might be the only option that would work -
> > however, I think we can simplify the problem and secure our test servers
> by
> > keeping them on a private network. (I don't think we are an important
> > target, so I'm not too worried about defense in depth)
> >
> > Between options 2 & 3 I would favor option 2 - I think having defaults
> > pre-set to work in the beam CI environment will lead to a lot of
> potential
> > confusion on the part of developers, since it would let them run the
> test,
> > and then only give them errors about connecting to a server. I think it'd
> > be better to get errors about not having pipeline options set, or errors
> > about obviously wrong/default pipeline option values, rather than
> "correct,
> > but not for your environment" settings. For example, if we have valid
> > defaults, and a user only sets some of the values (ie, sets IP, but not
> > port), it'd be trickier to detect when the value being used is valid. I'd
> > want the test to be able to show users an error that the option wasn't
> set,
> > rather than have it try to connect and timeout b/c the port isn't
> correct.
> >
> > I also don't think that having per-IT options is a good idea. If 2
> > different ITs can share a data store, we should be able to pass them the
> > same config - that also makes me strongly favor an option where we have a
> > shared set of options. Do we *have* to make it part of the
> > TestPipelineOptions? Can we make something like a
> BeamIOTestPipelineOption?
> >
> > This generally leads me to favor option 2.
> >
> > S
> >
> > On Tue, Jan 10, 2017 at 3:39 PM Jason Kuster <[email protected].
> > invalid>
> > wrote:
> >
> > > Hi all,
> > >
> > > Following up on some of the discussions already on-list, I wanted to
> > > solicit some more feedback about some implementation details regarding
> > the
> > > IO Integration Tests.
> > >
> > > As it currently stands, we mostly have IO ITs for GCP-based IO, which
> our
> > > GCP-based Jenkins executors handle natively, but as our integration
> test
> > > coverage begins to expand, we're going to run into several of the
> > problems
> > > relevant to what Steven is doing with hosting data stores for use by
> > ITs. I
> > > wanted to get people's feedback about how to handle passing credentials
> > to
> > > the ITs. We have a couple options, motivated by some goals.
> > >
> > > Goals:
> > >
> > > * Has to work in Apache Beam CI environment.
> > > * Has to run on dev machines (w/o access to beam CI environment).
> > > * One way of passing datastore config only.
> > > * An individual IT fails fast if run and it doesn't have valid config.
> > > * IO performance tests will have a validation component (this implies
> we
> > > need to run the IO ITs, not just the IO IT pipelines).
> > > * Devs working on an individual IO transform can run Integration & perf
> > > tests without recreating the data stores every time
> > > * Devs working on a runner's IO can run all the IO integration & perf
> > > tests. They may have to recreate the data stores every time (or
> possibly
> > > have a manual config that helps with this.) It's okay if this world is
> a
> > > bit awkward, it just needs to be possible.
> > >
> > >
> > > Option 1: IO Configuration File
> > >
> > > The first option is to read all credentials from some file stored on
> > disk.
> > > We can define a location for an (xml, json, yaml, etc) file which we
> can
> > > read in each IT to find the credentials that IT needs. This method has
> a
> > > couple of upsides and a couple of downsides.
> > >
> > > * Upsides
> > >     * Passing credentials to ITs, and adding new credentials, is
> > relatively
> > > easy.
> > >     * Individual users can spin up their own data stores, put the
> > > credentials in the file, run their ITs and have things just work.
> > > * Downsides
> > >     * Relying on a file, especially a file not checked in to the
> > repository
> > > (to prevent people from accidentally sharing credentials to their data
> > > store, etc) is fragile and can lead to some confusing failure cases.
> > >     * ITs are supposed to be self-contained; using a file on disk makes
> > > things like running them in CI harder.
> > >     * It seems like datastore location, username, and password are
> things
> > > that are a better fit for the IT PipelineOptions anyway.
> > >
> > >
> > > Option 2: TestPipelineOptions
> > >
> > > Another option is to specify them as general PipelineOptions on
> > > TestPipelineOptions and then to build the specific IT's options from
> > there.
> > > For example, say we have MyStoreIT1, MyStoreIT2 and MyStoreIT3. We
> could
> > > specify inside of TestPipelineOptions some options like "MyStoreIP",
> > > "MyStoreUsername", and "MyStorePassword", and then the command for
> > invoking
> > > them would look like (omitting some irrelevant things):
> > >
> > > mvn clean verify -DskipITs=false -DbeamTestPipelineOptions='[...,
> > > "--MyStoreIP=1.2.3.4", "--MyStoreUsername=beamuser",
> > > "--MyStorePassword=hunter2"]'.
> > > * Upsides
> > >     * Test is self-contained -- no dependency on an external file and
> all
> > > relevant things can be specified on the command line; easier for users
> > and
> > > CI.
> > >     * Passing credentials to ITs via pipelineoptions feels better.
> > > * Downsides
> > >     * Harder to pass different credentials to one specific IT; e.g. I
> > want
> > > MyStoreIT1 and 2 to run against 1.2.3.4, but MyStoreIT3 to run against
> > > 9.8.7.6.
> > >     * Investing in this pattern means a proliferation of
> > > TestPipelineOptions. Potentially bad, especially for a CI suite
> running a
> > > large number of ITs -- size of command line args may become
> unmanageable
> > > with 5+ data stores.
> > >
> > >
> > > Option 3: Individual IT Options
> > >
> > > The last option I can think of is to specify the options directly on
> the
> > > IT's options, e.g. MyStoreIT1Options, and set defaults which work well
> > for
> > > CI. This means that CI could run an entire suite of ITs without
> > specifying
> > > any arguments and trusting that the ITs' defaults will work, but means
> an
> > > individual developer is potentially able to run only one IT at a time,
> > > since it will be impossible to override all options from the command
> > line.
> > > * Upsides
> > >     * Test is still self-contained, and even more so -- possible to
> > specify
> > > args targeted at one IT in particular.
> > >     * Args are specified right where they're used; way smaller chance
> of
> > > confusion or mistakes.
> > >     * Easiest for CI -- as long as defaults for data store auth and
> > > location are correct from the perspective of the Jenkins executor, it
> can
> > > essentially just turn all ITs on and run them as is.
> > > * Downsides
> > >     * Hardest for individual developers to run an entire suite of ITs
> --
> > > since defaults are configured for running in CI environment, they will
> > > likely fail when running on the user's machine, resulting in annoyance
> > for
> > > the user.
> > >
> > >
> > > If anyone has thoughts on these, please let me know.
> > >
> > > Best,
> > >
> > > Jason
> > >
> > > --
> > > -------
> > > Jason Kuster
> > > Apache Beam (Incubating) / Google Cloud Dataflow
> > >
> >
>



-- 
-------
Jason Kuster
Apache Beam (Incubating) / Google Cloud Dataflow

Re: Some Thoughts on IO Integration Tests

Reply via email to