Re: IO Integration tests - concrete proposal

Etienne Chauchot Wed, 18 Jan 2017 07:57:48 -0800

Hi,

Yes, thanks all for these clarifications about testing architecture.

I agree that point 1 and 2 should be shared between tests as much aspossible. Especially sharing data loading between tests is moretime-effective and resource-effective: tests that need data (testRead,testSplit, ...) will save the loading time, the wait for asynchronousindexation and cleaning time. Just a small comment:

If we share the data loading between tests, then tests that expect anempty dataset (testWrite, ...), obviously cannot clear the shared dataset.

So they will need to write to a dedicated place (other than read tests)and clean it afterwards.

I will update ElasticSearch read IT(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)to not do data loading/cleaning and write IT to use another locationthan read IT


Etienne

Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit :

Hi guys,

Firs, great e-mail Stephen: complete and detailed proposal.
Lukasz raised a good point: it makes sense to be able to leverage thesame "bootstrap" script.
We discussed about providing the following in each IO:
1. code to load data (java, script, whatever)
2. script to bootstrap the backend (dockerfile, kubernetes script, ...)
3. actual integration tests
Only 3 is specific to the IO: 1 and 2 can be the same either if we runintegration tests for Python or integration tests for Java SDKs.
However, 3 may depend to 1 and 2 (the integration tests perform someassertion based on the loaded data for instance).Today, correct me if I'm wrong, but 1 and 2 will be executed by handor by Jenkins using a "description" of where the code and script arelocated.
So, I think that we can put 1 and 2 in the IO and use "descriptor" todo the bootstrapping.
Regards
JB

On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
Since docker containers can run a script on startup, can we embed the
initial data set into that script/container build so that the samedocker
container and initial data set can be used across multiple ITs. For
example, if Python and Java both have JdbcIO, it would be nice if they
could leverage the same docker container with the same data set toensure
the same pipeline produces the same results?

This would be different from embedding the data in the specific IT
implementation and would also create a coupling between ITs from
potentially multiple languages.

On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <s...@google.com.invalid>
wrote:
Hi all!
As I've discussed previously on this list[1], ensuring that we havehigh
quality IO Transforms is important to beam. We want to do this without
adding too much burden on developers wanting to contribute. Below Ihave aconcrete proposal for what an IO integration test would look likeand an
example integration test[4] that meets those requirements.

Proposal: we should require that an IO transform includes a passing
integration test showing the IO can connect to real instance of thedatastore. We still want/expect comprehensive unit tests on an IOtransform,but we would allow check ins with just some unit tests in thepresence of
an IT.
To support that, we'll require the following pieces associated withan IT:
1. Dockerfile that can be used to create a running instance of the data
store. We've previously discussed on this list that we would use docker
images running inside kubernetes or mesos[2], and I'd prefer having a
kubernetes/mesos script to start a given data store, but for a single
instance data store, we can take a dockerfile and use it to create asimple
kubernetes/mesos app. If you have questions about how maintaining the
containers long term would work, check [2] as I discussed a detailedplan
there.
2. Code to load test data on the data store created by #1. Needs tobe selfcontained. For now, the easiest way to do this would be to have codeinside
of the IT.

3. The IT. I propose keeping this inside of the same module as the IO
transform itself since having all the IO transform ITs in one modulewouldmean there may be conflicts between different data store'sdependencies.
Integration tests will need connection information pointing to the data
store it is testing. As discussed previously on this list[3], it should
receive that connection information via TestPipelineOptions.
I'd like to get something up and running soon so people checking innew IOtransforms can start taking advantage of an IT framework. Thus,there are a
couple simplifying assumptions in this plan. Pieces of the plan that I
anticipate will evolve:
1. The test data load script - we would like to write these in auniformway and especially ensure that the test data is cleaned up after thetests
run.

2. Spinning up/down instances - for now, we'd likely need to do this
manually. It'd be good to get an automated process for this. That's
especially critical for performance tests with multiple nodes -there's no
need to keep instances running for that.
Integrating closer with PKB would be a good way to do both of thesethings,
but first let's focus on getting some basic ITs running.

As a concrete example of this proposal, I've written JDBC IO IT [4].
JdbcIOTest already did a lot of test setup, so I heavily re-used it.The
key pieces:

* The integration test is in JdbcIOIT.
* JdbcIOIT reads the TestPipelineOptions defined inPostgresTestOptions. Wemay move the TestOptions files into a common place so they can beshared
between tests.

* Test data is created/cleaned up inside of the IT.

* kubernetes/mesos scripts - I have provided examples of both under the
"jdbc/src/test/resources" directory, but I'd like us to decide as aprojectwhich container orchestration service we want to use - I'll sendmail about
that shortly.

thanks!
Stephen

[1] Integration Testing Sources
https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E

[2] Container Orchestration software for hosting data stores
https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E

[3] Some Thoughts on IO Integration Tests
https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E

[4] JDBC IO IT using postgres
https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -have notbeen reviewed yet, so may contain code errors, but it does run &pass :)

Re: IO Integration tests - concrete proposal

Reply via email to