Hi all, Following up on some of the discussions already on-list, I wanted to solicit some more feedback about some implementation details regarding the IO Integration Tests.
As it currently stands, we mostly have IO ITs for GCP-based IO, which our GCP-based Jenkins executors handle natively, but as our integration test coverage begins to expand, we're going to run into several of the problems relevant to what Steven is doing with hosting data stores for use by ITs. I wanted to get people's feedback about how to handle passing credentials to the ITs. We have a couple options, motivated by some goals. Goals: * Has to work in Apache Beam CI environment. * Has to run on dev machines (w/o access to beam CI environment). * One way of passing datastore config only. * An individual IT fails fast if run and it doesn't have valid config. * IO performance tests will have a validation component (this implies we need to run the IO ITs, not just the IO IT pipelines). * Devs working on an individual IO transform can run Integration & perf tests without recreating the data stores every time * Devs working on a runner's IO can run all the IO integration & perf tests. They may have to recreate the data stores every time (or possibly have a manual config that helps with this.) It's okay if this world is a bit awkward, it just needs to be possible. Option 1: IO Configuration File The first option is to read all credentials from some file stored on disk. We can define a location for an (xml, json, yaml, etc) file which we can read in each IT to find the credentials that IT needs. This method has a couple of upsides and a couple of downsides. * Upsides * Passing credentials to ITs, and adding new credentials, is relatively easy. * Individual users can spin up their own data stores, put the credentials in the file, run their ITs and have things just work. * Downsides * Relying on a file, especially a file not checked in to the repository (to prevent people from accidentally sharing credentials to their data store, etc) is fragile and can lead to some confusing failure cases. * ITs are supposed to be self-contained; using a file on disk makes things like running them in CI harder. * It seems like datastore location, username, and password are things that are a better fit for the IT PipelineOptions anyway. Option 2: TestPipelineOptions Another option is to specify them as general PipelineOptions on TestPipelineOptions and then to build the specific IT's options from there. For example, say we have MyStoreIT1, MyStoreIT2 and MyStoreIT3. We could specify inside of TestPipelineOptions some options like "MyStoreIP", "MyStoreUsername", and "MyStorePassword", and then the command for invoking them would look like (omitting some irrelevant things): mvn clean verify -DskipITs=false -DbeamTestPipelineOptions='[..., "--MyStoreIP=1.2.3.4", "--MyStoreUsername=beamuser", "--MyStorePassword=hunter2"]'. * Upsides * Test is self-contained -- no dependency on an external file and all relevant things can be specified on the command line; easier for users and CI. * Passing credentials to ITs via pipelineoptions feels better. * Downsides * Harder to pass different credentials to one specific IT; e.g. I want MyStoreIT1 and 2 to run against 1.2.3.4, but MyStoreIT3 to run against 9.8.7.6. * Investing in this pattern means a proliferation of TestPipelineOptions. Potentially bad, especially for a CI suite running a large number of ITs -- size of command line args may become unmanageable with 5+ data stores. Option 3: Individual IT Options The last option I can think of is to specify the options directly on the IT's options, e.g. MyStoreIT1Options, and set defaults which work well for CI. This means that CI could run an entire suite of ITs without specifying any arguments and trusting that the ITs' defaults will work, but means an individual developer is potentially able to run only one IT at a time, since it will be impossible to override all options from the command line. * Upsides * Test is still self-contained, and even more so -- possible to specify args targeted at one IT in particular. * Args are specified right where they're used; way smaller chance of confusion or mistakes. * Easiest for CI -- as long as defaults for data store auth and location are correct from the perspective of the Jenkins executor, it can essentially just turn all ITs on and run them as is. * Downsides * Hardest for individual developers to run an entire suite of ITs -- since defaults are configured for running in CI environment, they will likely fail when running on the user's machine, resulting in annoyance for the user. If anyone has thoughts on these, please let me know. Best, Jason -- ------- Jason Kuster Apache Beam (Incubating) / Google Cloud Dataflow