Re: IO Integration tests - concrete proposal

Etienne Chauchot Wed, 25 Jan 2017 00:55:22 -0800

Hey Stephen,

That seems perfect!

Another thing, more about software design, maybe you could add in theguide comments what have been discussed in the ML about making standardthe use of:


- IOService interface in UT and IT,

- implementations EmbeddedIOService and MockIOServcice for UT

- implementation RealIOService for IT (name proposal)

if we all have an agreement on these points. Maybe it requires some morediscussions (methods in the interface, are almost passthroughimplementations -EmbeddedIOService, RealIOService - needed, ...)


Etienne


Le 24/01/2017 à 06:47, Stephen Sisk a écrit :

hey,

thanks - these are good questions/thoughts.

I am more reserved on that one regarding flakiness. IMHO, it is better to

clean in all cases.

I strongly agree that we should attempt to clean in each case, and the
system should support that. I should have stated that more firmly. As I
think about it more, you're also right that we should just not try to do
the data loading inside of the test. I amended the guidelines based on your
comments and put them in the draft "Testing IO transforms in Apache Beam"
doc that I've been working on [1].

Here's that snippet:
"""

For both types of tests (integration and performance), you'll need to have
scripts that set up your test data - they will be run independent of the
tests themselves.

The Integration and Perf Tests themselves:

1. Can assume the data load script has been run before the test

2. Must work if they are run multiple times without the data load script
being run in between (ie, they should clean up after themselves or use
namespacing such that tests don't interfere with one another)

3. Read tests must not load data or clean data

4. Write tests must use another storage location than read tests (using
namespace/table names/etc.. for example) and if possible clean it after
each test.
"""

Any other comments?

Stephen

[1]
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m

On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot <echauc...@gmail.com>
wrote:

Hi Stephen,

My comments are inline

Le 19/01/2017 à 20:32, Stephen Sisk a écrit :

I definitely agree that sharing resources between tests is more efficient.

Etienne - do you think it's necessary to separate the IT from the data
loading script?

Actually, I see separation between IT and loading script more as a an
improvement (time and resource effective) not as a necessity. Indeed,
for now, for example, loading in ES IT is done within the IT (see
https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)

The postgres/JdbcIOIT can use the natural namespacing of
tables and I feel pretty comfortable that will work well over time.

You mean using the same table name with different namespace? But IMHO,
it is still "using another place" that I mentioned, read IT and write IT
could use same table name in different namespaces.

   You
haven't explicitly mentioned it, but I'm assuming that elasticsearch
doesn't allow such namespacing, so that's why you're having to do the
separation?

Actually in ES, there is no namespace notion but there is index name.
The index is the documents storing entity that is split. And there is
the document type that is more like a class definition for the document.
So basically, we could have read IT using readIndex.docType and write IT
using writeIndex.docType.

I'm not trying to discourage separating data load from IT, just
wondering whether it's truly necessary.

IMHO, more like an optimization like I mentioned.

I was trying to consolidate what we're discussed down to a few guidelines.
I think those are that IO ITs:
1. Can assume the data load script has been run before the test (unless

the

data load script is run by the test itself)

I Agree

2. Must work if they are run multiple times without the data load script
being run in between (ie, they should clean up after themselves or use
namespacing such that tests don't interfere with one another)

Yes, sure

3. Tests that generate large amounts of data will attempt to clean up

after

themselves. (ie, if you just write 100 rows, don't worry about it - if you
write 5 gb of data, you'd need to clean up.) We will not assume this will
always succeed in cleaning up, but my assumption is that if a particular
data store gets into a bad state, we'll just destroy/recreate that
particular data store.

I am more reserved on that one regarding flakiness. IMHO, it is better
to clean in all cases. I mentioned in a thread that sharding in the
datastore might change depending on data volume (it is not he case for
ES because the sharding is defined by configuration) or a
shard/partition in the datastore can become so big that it will be split
more by the IO. Imagine that a test that writes 100 rows does not do
cleanup and is run 1 000 times, then the storage entity becomes bigger
and bigger and it might then be split into more bundles than asserted in
split tests (either by decision of the datastore or because
desiredBundleSize is small)

If the tests follow those assumptions, then that should support all the
scenarios I can think of: running data store create + data load script
occasionally (say, once a week or month) all the way up to running them
once per test run (if we decided to go that far.)

Yes but do we chose to enforce a standard way of coding integration
tests such as
- loading data is done by and exterior loading script
- read tests: do not load data,  do not clean data
- write tests: use another storage place than read tests (using
namespace for example) and clean it after each test.
?

Etienne

S

On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot <echauc...@gmail.com>
wrote:

Hi,

Yes, thanks all for these clarifications about testing architecture.

I agree that point 1 and 2 should be shared between tests as much as
possible. Especially sharing data loading between tests is more
time-effective and resource-effective: tests that need data (testRead,
testSplit, ...) will save the loading time, the wait for asynchronous
indexation and cleaning time. Just a small comment:

If we share the data loading between tests, then tests that expect an
empty dataset (testWrite, ...), obviously cannot clear the shared dataset.

So they will need to write to a dedicated place (other than read tests)
and clean it afterwards.

I will update ElasticSearch read IT
(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
to not do data loading/cleaning and write IT to use another location
than read IT

Etienne

Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit :

Hi guys,

Firs, great e-mail Stephen: complete and detailed proposal.

Lukasz raised a good point: it makes sense to be able to leverage the
same "bootstrap" script.

We discussed about providing the following in each IO:
1. code to load data (java, script, whatever)
2. script to bootstrap the backend (dockerfile, kubernetes script, ...)
3. actual integration tests

Only 3 is specific to the IO: 1 and 2 can be the same either if we run
integration tests for Python or integration tests for Java SDKs.

However,  3 may depend to 1 and 2 (the integration tests perform some
assertion based on the loaded data for instance).
Today, correct me if I'm  wrong, but 1 and 2 will be executed by hand
or by Jenkins using a "description" of where the code and script are
located.

So, I think that we can put 1 and 2 in the IO and use "descriptor" to
do the bootstrapping.

Regards
JB

On 01/17/2017 04:37 PM, Lukasz Cwik wrote:

Since docker containers can run a script on startup, can we embed the
initial data set into that script/container build so that the same
docker
container and initial data set can be used across multiple ITs. For
example, if Python and Java both have JdbcIO, it would be nice if they
could leverage the same docker container with the same data set to
ensure
the same pipeline produces the same results?

This would be different from embedding the data in the specific IT
implementation and would also create a coupling between ITs from
potentially multiple languages.

On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <s...@google.com.invalid>
wrote:

Hi all!

As I've discussed previously on this list[1], ensuring that we have
high
quality IO Transforms is important to beam. We want to do this without
adding too much burden on developers wanting to contribute. Below I
have a
concrete proposal for what an IO integration test would look like
and an
example integration test[4] that meets those requirements.

Proposal: we should require that an IO transform includes a passing
integration test showing the IO can connect to real instance of the
data
store. We still want/expect comprehensive unit tests on an IO
transform,
but we would allow check ins with just some unit tests in the
presence of
an IT.

To support that, we'll require the following pieces associated with
an IT:

1. Dockerfile that can be used to create a running instance of the data
store. We've previously discussed on this list that we would use docker
images running inside kubernetes or mesos[2], and I'd prefer having a
kubernetes/mesos script to start a given data store, but for a single
instance data store, we can take a dockerfile and use it to create a
simple
kubernetes/mesos app. If you have questions about how maintaining the
containers long term would work, check [2] as I discussed a detailed
plan
there.

2. Code to load test data on the data store created by #1. Needs to
be self
contained. For now, the easiest way to do this would be to have code
inside
of the IT.

3. The IT. I propose keeping this inside of the same module as the IO
transform itself since having all the IO transform ITs in one module
would
mean there may be conflicts between different data store's
dependencies.
Integration tests will need connection information pointing to the data
store it is testing. As discussed previously on this list[3], it should
receive that connection information via TestPipelineOptions.

I'd like to get something up and running soon so people checking in
new IO
transforms can start taking advantage of an IT framework. Thus,
there are a
couple simplifying assumptions in this plan. Pieces of the plan that I
anticipate will evolve:

1. The test data load script - we would like to write these in a
uniform
way and especially ensure that the test data is cleaned up after the
tests
run.

2. Spinning up/down instances - for now, we'd likely need to do this
manually. It'd be good to get an automated process for this. That's
especially critical for performance tests with multiple nodes -
there's no
need to keep instances running for that.

Integrating closer with PKB would be a good way to do both of these
things,
but first let's focus on getting some basic ITs running.

As a concrete example of this proposal, I've written JDBC IO IT [4].
JdbcIOTest already did a lot of test setup, so I heavily re-used it.
The
key pieces:

* The integration test is in JdbcIOIT.

* JdbcIOIT reads the TestPipelineOptions defined in
PostgresTestOptions. We
may move the TestOptions files into a common place so they can be
shared
between tests.

* Test data is created/cleaned up inside of the IT.

* kubernetes/mesos scripts - I have provided examples of both under the
"jdbc/src/test/resources" directory, but I'd like us to decide as a
project
which container orchestration service we want to use - I'll send
mail about
that shortly.

thanks!
Stephen

[1] Integration Testing Sources
https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E

[2] Container Orchestration software for hosting data stores
https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E

[3] Some Thoughts on IO Integration Tests
https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E

[4] JDBC IO IT using postgres
https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -
have not
been reviewed yet, so may contain code errors, but it does run &
pass :)

Re: IO Integration tests - concrete proposal

Reply via email to