Hi all,

Following up on some of the discussions already on-list, I wanted to
solicit some more feedback about some implementation details regarding the
IO Integration Tests.

As it currently stands, we mostly have IO ITs for GCP-based IO, which our
GCP-based Jenkins executors handle natively, but as our integration test
coverage begins to expand, we're going to run into several of the problems
relevant to what Steven is doing with hosting data stores for use by ITs. I
wanted to get people's feedback about how to handle passing credentials to
the ITs. We have a couple options, motivated by some goals.

Goals:

* Has to work in Apache Beam CI environment.
* Has to run on dev machines (w/o access to beam CI environment).
* One way of passing datastore config only.
* An individual IT fails fast if run and it doesn't have valid config.
* IO performance tests will have a validation component (this implies we
need to run the IO ITs, not just the IO IT pipelines).
* Devs working on an individual IO transform can run Integration & perf
tests without recreating the data stores every time
* Devs working on a runner's IO can run all the IO integration & perf
tests. They may have to recreate the data stores every time (or possibly
have a manual config that helps with this.) It's okay if this world is a
bit awkward, it just needs to be possible.


Option 1: IO Configuration File

The first option is to read all credentials from some file stored on disk.
We can define a location for an (xml, json, yaml, etc) file which we can
read in each IT to find the credentials that IT needs. This method has a
couple of upsides and a couple of downsides.

* Upsides
    * Passing credentials to ITs, and adding new credentials, is relatively
easy.
    * Individual users can spin up their own data stores, put the
credentials in the file, run their ITs and have things just work.
* Downsides
    * Relying on a file, especially a file not checked in to the repository
(to prevent people from accidentally sharing credentials to their data
store, etc) is fragile and can lead to some confusing failure cases.
    * ITs are supposed to be self-contained; using a file on disk makes
things like running them in CI harder.
    * It seems like datastore location, username, and password are things
that are a better fit for the IT PipelineOptions anyway.


Option 2: TestPipelineOptions

Another option is to specify them as general PipelineOptions on
TestPipelineOptions and then to build the specific IT's options from there.
For example, say we have MyStoreIT1, MyStoreIT2 and MyStoreIT3. We could
specify inside of TestPipelineOptions some options like "MyStoreIP",
"MyStoreUsername", and "MyStorePassword", and then the command for invoking
them would look like (omitting some irrelevant things):

mvn clean verify -DskipITs=false -DbeamTestPipelineOptions='[...,
"--MyStoreIP=1.2.3.4", "--MyStoreUsername=beamuser",
"--MyStorePassword=hunter2"]'.
* Upsides
    * Test is self-contained -- no dependency on an external file and all
relevant things can be specified on the command line; easier for users and
CI.
    * Passing credentials to ITs via pipelineoptions feels better.
* Downsides
    * Harder to pass different credentials to one specific IT; e.g. I want
MyStoreIT1 and 2 to run against 1.2.3.4, but MyStoreIT3 to run against
9.8.7.6.
    * Investing in this pattern means a proliferation of
TestPipelineOptions. Potentially bad, especially for a CI suite running a
large number of ITs -- size of command line args may become unmanageable
with 5+ data stores.


Option 3: Individual IT Options

The last option I can think of is to specify the options directly on the
IT's options, e.g. MyStoreIT1Options, and set defaults which work well for
CI. This means that CI could run an entire suite of ITs without specifying
any arguments and trusting that the ITs' defaults will work, but means an
individual developer is potentially able to run only one IT at a time,
since it will be impossible to override all options from the command line.
* Upsides
    * Test is still self-contained, and even more so -- possible to specify
args targeted at one IT in particular.
    * Args are specified right where they're used; way smaller chance of
confusion or mistakes.
    * Easiest for CI -- as long as defaults for data store auth and
location are correct from the perspective of the Jenkins executor, it can
essentially just turn all ITs on and run them as is.
* Downsides
    * Hardest for individual developers to run an entire suite of ITs --
since defaults are configured for running in CI environment, they will
likely fail when running on the user's machine, resulting in annoyance for
the user.


If anyone has thoughts on these, please let me know.

Best,

Jason

-- 
-------
Jason Kuster
Apache Beam (Incubating) / Google Cloud Dataflow

Reply via email to