Re: Some Thoughts on IO Integration Tests

Stephen Sisk Fri, 13 Jan 2017 10:24:09 -0800

yes. I'll be out Monday, but around the rest of the week to discuss.

S


On Fri, Jan 13, 2017 at 12:14 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi guys,
>
> sorry I'm a bit late on this thread.
>
> I think we might move forward step by step. I started to prepare the
> ITests for Reddis and RabbitMQ IOs. I will submit the PRs.
>
> Agree with Stephen, we got an agreement, I think it makes sense to move
> forward.
>
> Stephen, would you have some time next week to discuss/move forward
> about that ?
>
> Thanks !
> Regards
> JB
>
> On 01/13/2017 03:53 AM, Stephen Sisk wrote:
> > Thanks, that helps. I've been thinking of this mostly for connection
> > options which I'm hoping should be pretty generic.
> >
> > Do we have known examples that we will want to use this with today, or is
> > this a hypothetical? If it's hypothetical for now, would it make sense to
> > defer making changes to TestPipeline?
> >
> > I suspect I'm over-fixating too much on one part - basically sounds like
> > we've got an agreement. I'm excited about this, and I'll plan on having
> ITs
> > that assume they get connection info from pipeline options.
> >
> > S
> >
> > On Thu, Jan 12, 2017 at 6:04 PM Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> >
> >> Imagine you had a KafkaReadIT and a KafkaWriteIT.
> >> Both ITs share the same Kafka configuration options (host, topic, ...)
> >> found in KafkaITPipelineOptions yet the KafkaReadIT also needs to have
> an
> >> additional parameter (like position to start reading kafka topic from)
> >> found in KafkaReadITPipelineOptions that extends PipelineOptions.
> >>
> >>
> >> On Thu, Jan 12, 2017 at 5:56 PM, Stephen Sisk <s...@google.com.invalid>
> >> wrote:
> >>
> >>> I see the need for/like KinesisIOTestPipelineOptions - it would allow
> all
> >>> ITs/perf tests that need a kinesis connection to get one.
> >>>
> >>> What problem are we trying to solve with KinesisIOTest1PipelineOptions
> ?
> >>>
> >>> This presumes that there would also be KinesisIOTest2PipelineOptions ?
> >> What
> >>> would be different between the two? Maybe if we had a scenario and used
> >>> "real" names (ie, what we'd actually call the variables)
> >>>
> >>> S
> >>>
> >>> On Thu, Jan 12, 2017 at 5:47 PM Jason Kuster <jasonkus...@google.com.
> >>> invalid>
> >>> wrote:
> >>>
> >>>> Stephen, I agree with you about option 2 -- I think it strikes the
> >> right
> >>>> balance. To your question at the end, Luke essentially answered it,
> but
> >>> to
> >>>> elaborate: we could make e.g. KinesisIOTestPipelineOptions which
> >> extends
> >>>> TestPipelineOptions and then KinesisIOTest1PipelineOptions which
> >> extends
> >>>> KinesisIOTestPipelineOptions (whew, that's a workout for the fingers).
> >> It
> >>>> requires a change to how we parse PipelineOptions inside of
> >>>> TestPipeline.java which means we lose a bit of user-friendliness, but
> >>>> enables this scenario which I think is worth the tradeoff.
> >>>>
> >>>> On Tue, Jan 10, 2017 at 4:23 PM, Lukasz Cwik <lc...@google.com.invalid
> >>>
> >>>> wrote:
> >>>>
> >>>>> PipelineOptions allows you to aggregate common options within a
> >> parent
> >>>>> interface with child interfaces extending and inheriting those
> >> options.
> >>>>>
> >>>>> For example you can have MyIOOptionsTest1 which extends MyIOOptions
> >> and
> >>>>> also MyIOOptionsTest2 which extends MyIOOptions. MyIOOptions can have
> >>> all
> >>>>> the shared fields such as port or host or whatever and then
> >>>>> MyIOOptionsTest1 can have specific parameters for Test1 and
> >>>>> MyIOOptionsTest2 can have specific parameters for Test2.
> >>>>>
> >>>>>
> >>>>> On Tue, Jan 10, 2017 at 4:11 PM, Stephen Sisk
> >> <s...@google.com.invalid
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> thanks Jason for doing this research! There are a lot of options.
> >>>>>>
> >>>>>> You mentioned "Steven" - I assume you're talking about me :)
> >>>>>>
> >>>>>> As you mention in your goals, we want this to work for both
> >>> individual
> >>>>> devs
> >>>>>> working on a particular store, as well as the beam CI system. I'm
> >>>>> inclined
> >>>>>> to favor making this pretty easy for individual devs, as long as it
> >>>>> doesn't
> >>>>>> come at the cost of making CI config painful. Given that, I suspect
> >>>> that
> >>>>>> we'd want something that's straightforward to discover how to
> >>>> configure,
> >>>>>> that there's an obvious error message if it's misconfigured, and is
> >>>> easy
> >>>>> to
> >>>>>> learn.
> >>>>>>
> >>>>>> In that world, I would learn towards options 2 & 3. I think a file
> >>> that
> >>>>> has
> >>>>>> to be maintained/configured separately outside of the repository
> >>> isn't
> >>>>> the
> >>>>>> most developer friendly option. That may be somewhat of a workflow
> >>>> thing.
> >>>>>> And if it was super-important to keep the passwords secret, then a
> >>> file
> >>>>>> outside of the repository might be the only option that would work
> >> -
> >>>>>> however, I think we can simplify the problem and secure our test
> >>>> servers
> >>>>> by
> >>>>>> keeping them on a private network. (I don't think we are an
> >> important
> >>>>>> target, so I'm not too worried about defense in depth)
> >>>>>>
> >>>>>> Between options 2 & 3 I would favor option 2 - I think having
> >>> defaults
> >>>>>> pre-set to work in the beam CI environment will lead to a lot of
> >>>>> potential
> >>>>>> confusion on the part of developers, since it would let them run
> >> the
> >>>>> test,
> >>>>>> and then only give them errors about connecting to a server. I
> >> think
> >>>> it'd
> >>>>>> be better to get errors about not having pipeline options set, or
> >>>> errors
> >>>>>> about obviously wrong/default pipeline option values, rather than
> >>>>> "correct,
> >>>>>> but not for your environment" settings. For example, if we have
> >> valid
> >>>>>> defaults, and a user only sets some of the values (ie, sets IP, but
> >>> not
> >>>>>> port), it'd be trickier to detect when the value being used is
> >> valid.
> >>>> I'd
> >>>>>> want the test to be able to show users an error that the option
> >>> wasn't
> >>>>> set,
> >>>>>> rather than have it try to connect and timeout b/c the port isn't
> >>>>> correct.
> >>>>>>
> >>>>>> I also don't think that having per-IT options is a good idea. If 2
> >>>>>> different ITs can share a data store, we should be able to pass
> >> them
> >>>> the
> >>>>>> same config - that also makes me strongly favor an option where we
> >>>> have a
> >>>>>> shared set of options. Do we *have* to make it part of the
> >>>>>> TestPipelineOptions? Can we make something like a
> >>>>> BeamIOTestPipelineOption?
> >>>>>>
> >>>>>> This generally leads me to favor option 2.
> >>>>>>
> >>>>>> S
> >>>>>>
> >>>>>> On Tue, Jan 10, 2017 at 3:39 PM Jason Kuster <
> >> jasonkus...@google.com
> >>> .
> >>>>>> invalid>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> Following up on some of the discussions already on-list, I wanted
> >>> to
> >>>>>>> solicit some more feedback about some implementation details
> >>>> regarding
> >>>>>> the
> >>>>>>> IO Integration Tests.
> >>>>>>>
> >>>>>>> As it currently stands, we mostly have IO ITs for GCP-based IO,
> >>> which
> >>>>> our
> >>>>>>> GCP-based Jenkins executors handle natively, but as our
> >> integration
> >>>>> test
> >>>>>>> coverage begins to expand, we're going to run into several of the
> >>>>>> problems
> >>>>>>> relevant to what Steven is doing with hosting data stores for use
> >>> by
> >>>>>> ITs. I
> >>>>>>> wanted to get people's feedback about how to handle passing
> >>>> credentials
> >>>>>> to
> >>>>>>> the ITs. We have a couple options, motivated by some goals.
> >>>>>>>
> >>>>>>> Goals:
> >>>>>>>
> >>>>>>> * Has to work in Apache Beam CI environment.
> >>>>>>> * Has to run on dev machines (w/o access to beam CI environment).
> >>>>>>> * One way of passing datastore config only.
> >>>>>>> * An individual IT fails fast if run and it doesn't have valid
> >>>> config.
> >>>>>>> * IO performance tests will have a validation component (this
> >>> implies
> >>>>> we
> >>>>>>> need to run the IO ITs, not just the IO IT pipelines).
> >>>>>>> * Devs working on an individual IO transform can run Integration
> >> &
> >>>> perf
> >>>>>>> tests without recreating the data stores every time
> >>>>>>> * Devs working on a runner's IO can run all the IO integration &
> >>> perf
> >>>>>>> tests. They may have to recreate the data stores every time (or
> >>>>> possibly
> >>>>>>> have a manual config that helps with this.) It's okay if this
> >> world
> >>>> is
> >>>>> a
> >>>>>>> bit awkward, it just needs to be possible.
> >>>>>>>
> >>>>>>>
> >>>>>>> Option 1: IO Configuration File
> >>>>>>>
> >>>>>>> The first option is to read all credentials from some file stored
> >>> on
> >>>>>> disk.
> >>>>>>> We can define a location for an (xml, json, yaml, etc) file which
> >>> we
> >>>>> can
> >>>>>>> read in each IT to find the credentials that IT needs. This
> >> method
> >>>> has
> >>>>> a
> >>>>>>> couple of upsides and a couple of downsides.
> >>>>>>>
> >>>>>>> * Upsides
> >>>>>>>     * Passing credentials to ITs, and adding new credentials, is
> >>>>>> relatively
> >>>>>>> easy.
> >>>>>>>     * Individual users can spin up their own data stores, put the
> >>>>>>> credentials in the file, run their ITs and have things just work.
> >>>>>>> * Downsides
> >>>>>>>     * Relying on a file, especially a file not checked in to the
> >>>>>> repository
> >>>>>>> (to prevent people from accidentally sharing credentials to their
> >>>> data
> >>>>>>> store, etc) is fragile and can lead to some confusing failure
> >>> cases.
> >>>>>>>     * ITs are supposed to be self-contained; using a file on disk
> >>>> makes
> >>>>>>> things like running them in CI harder.
> >>>>>>>     * It seems like datastore location, username, and password
> >> are
> >>>>> things
> >>>>>>> that are a better fit for the IT PipelineOptions anyway.
> >>>>>>>
> >>>>>>>
> >>>>>>> Option 2: TestPipelineOptions
> >>>>>>>
> >>>>>>> Another option is to specify them as general PipelineOptions on
> >>>>>>> TestPipelineOptions and then to build the specific IT's options
> >>> from
> >>>>>> there.
> >>>>>>> For example, say we have MyStoreIT1, MyStoreIT2 and MyStoreIT3.
> >> We
> >>>>> could
> >>>>>>> specify inside of TestPipelineOptions some options like
> >>> "MyStoreIP",
> >>>>>>> "MyStoreUsername", and "MyStorePassword", and then the command
> >> for
> >>>>>> invoking
> >>>>>>> them would look like (omitting some irrelevant things):
> >>>>>>>
> >>>>>>> mvn clean verify -DskipITs=false -DbeamTestPipelineOptions='[...,
> >>>>>>> "--MyStoreIP=1.2.3.4", "--MyStoreUsername=beamuser",
> >>>>>>> "--MyStorePassword=hunter2"]'.
> >>>>>>> * Upsides
> >>>>>>>     * Test is self-contained -- no dependency on an external file
> >>> and
> >>>>> all
> >>>>>>> relevant things can be specified on the command line; easier for
> >>>> users
> >>>>>> and
> >>>>>>> CI.
> >>>>>>>     * Passing credentials to ITs via pipelineoptions feels
> >> better.
> >>>>>>> * Downsides
> >>>>>>>     * Harder to pass different credentials to one specific IT;
> >>> e.g. I
> >>>>>> want
> >>>>>>> MyStoreIT1 and 2 to run against 1.2.3.4, but MyStoreIT3 to run
> >>>> against
> >>>>>>> 9.8.7.6.
> >>>>>>>     * Investing in this pattern means a proliferation of
> >>>>>>> TestPipelineOptions. Potentially bad, especially for a CI suite
> >>>>> running a
> >>>>>>> large number of ITs -- size of command line args may become
> >>>>> unmanageable
> >>>>>>> with 5+ data stores.
> >>>>>>>
> >>>>>>>
> >>>>>>> Option 3: Individual IT Options
> >>>>>>>
> >>>>>>> The last option I can think of is to specify the options directly
> >>> on
> >>>>> the
> >>>>>>> IT's options, e.g. MyStoreIT1Options, and set defaults which work
> >>>> well
> >>>>>> for
> >>>>>>> CI. This means that CI could run an entire suite of ITs without
> >>>>>> specifying
> >>>>>>> any arguments and trusting that the ITs' defaults will work, but
> >>>> means
> >>>>> an
> >>>>>>> individual developer is potentially able to run only one IT at a
> >>>> time,
> >>>>>>> since it will be impossible to override all options from the
> >>> command
> >>>>>> line.
> >>>>>>> * Upsides
> >>>>>>>     * Test is still self-contained, and even more so -- possible
> >> to
> >>>>>> specify
> >>>>>>> args targeted at one IT in particular.
> >>>>>>>     * Args are specified right where they're used; way smaller
> >>> chance
> >>>>> of
> >>>>>>> confusion or mistakes.
> >>>>>>>     * Easiest for CI -- as long as defaults for data store auth
> >> and
> >>>>>>> location are correct from the perspective of the Jenkins
> >> executor,
> >>> it
> >>>>> can
> >>>>>>> essentially just turn all ITs on and run them as is.
> >>>>>>> * Downsides
> >>>>>>>     * Hardest for individual developers to run an entire suite of
> >>> ITs
> >>>>> --
> >>>>>>> since defaults are configured for running in CI environment, they
> >>>> will
> >>>>>>> likely fail when running on the user's machine, resulting in
> >>>> annoyance
> >>>>>> for
> >>>>>>> the user.
> >>>>>>>
> >>>>>>>
> >>>>>>> If anyone has thoughts on these, please let me know.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> Jason
> >>>>>>>
> >>>>>>> --
> >>>>>>> -------
> >>>>>>> Jason Kuster
> >>>>>>> Apache Beam (Incubating) / Google Cloud Dataflow
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> -------
> >>>> Jason Kuster
> >>>> Apache Beam (Incubating) / Google Cloud Dataflow
> >>>>
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Some Thoughts on IO Integration Tests

Reply via email to