Hi all!

We are currently working on the IO IT "test harness" that will allow to run
the IOITs on various runners, filesystems and with changing amount of data.
It is described in a doc some of you have probably seen and put comments in
the doc [1] (in the context of BEAM-3060 [2] task).

Part of the job that in our opinion could be done is to somehow validate
the input parameters that are passed to the test. By saying "test input
parameters" I specifically mean:
 - parameters passed as mvn's system properties, such as:
"-Dfilesystem=gcs" or "-DintegrationTestRunner=dataflow"
 - parameters that are to be passed as PipelineOptions, such as
"numberOfRecords" or "filenamePrefix" in TextIO

We imagine a situation when test parameters are passed in an incompatible
way, eg. someone can be willing to use the "dataflow" runner on some
filesystem that is unsupported there (say, s3). Running an IOIT with such
setup will most certainly fail. The crux of the idea is to inform the
developer early enough so that no such errors are made and test execution
time is saved. It eases debugging and avoids potential configuration errors.

The doc [1] fully specifies what has to be validated. We'd like to validate:
1. runner + filesystem combination  - both passed as system properties
2. filesystem + pipeline options - as some additional options may be
required by specific FS.
3. runner + pipeline options  - as some additional options may be required
by specific runner
4. IO test class instance (eg. TextIO) + options dedicated to it (eg.
numberOfRecords)

We have an idea to write a small maven plugin for the validation. An
initial PoC to just view the concept is on GitHub [3]. The plugin would
essentialy be a small module in the ".test-infra" folder - the same as
other test-related stuff such as kubernetes scripts or jenkins files. We
should hide only the validation logic there - what gets validated should be
declaratively provided in pom.xml of each IOIT. All necessary validation
configuration could be passed in plugin's <configuration> section [4], [5].

Pros:
 + input parameters validation is run very early - before compilation. We
don't waste time and wait for the tests to fail
 + validation can be run depending on profile - we can append the plugin to
the profile's build section (eg. to an io-it profiles build section).
 + validation is run once - after the first failure there is an exception
thrown which breaks the build (in case of running more than one test)
 + (AFAIK) we could support multiple sdks in the future with this plugin -
the only condition is being able to start the mvn plugin in some initial
build phase
 + (AFAIK) we can run it on gradle or adapt it to be a gradle task too
 + every IO can have a set of validation rules defined in it's pom.xml
which is readable and clearly shows what combination can be done or not)

Cons:
- everytime a new filesystem/runner etc is supported we have to update the
rules
- the plugin needs to be a separate module

Concerns:
 - is validating all the four points there a good idea? Perhaps the scope
of the things to validate in the plugin is too wide? As far as I know,
validation for runner specific PipelineOptions is already available.

What do you think?

Thanks,
Łukasz

[1]
https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.84fcpdbcdqu

[2] https://issues.apache.org/jira/browse/BEAM-3060
[3]
https://github.com/lgajowy/beam/blob/validator-poc/.test-infra/testprops/src/main/java/org/apache/beam/testprops/TestPropsPlugin.java
[4]
https://github.com/lgajowy/beam/blob/validator-poc/sdks/java/io/jdbc/pom.xml#L46
[5]
https://github.com/lgajowy/beam/blob/validator-poc/sdks/java/io/file-based-io-tests/pom.xml#L47

Reply via email to