hey, thanks - these are good questions/thoughts.
> I am more reserved on that one regarding flakiness. IMHO, it is better to clean in all cases. I strongly agree that we should attempt to clean in each case, and the system should support that. I should have stated that more firmly. As I think about it more, you're also right that we should just not try to do the data loading inside of the test. I amended the guidelines based on your comments and put them in the draft "Testing IO transforms in Apache Beam" doc that I've been working on [1]. Here's that snippet: """ For both types of tests (integration and performance), you'll need to have scripts that set up your test data - they will be run independent of the tests themselves. The Integration and Perf Tests themselves: 1. Can assume the data load script has been run before the test 2. Must work if they are run multiple times without the data load script being run in between (ie, they should clean up after themselves or use namespacing such that tests don't interfere with one another) 3. Read tests must not load data or clean data 4. Write tests must use another storage location than read tests (using namespace/table names/etc.. for example) and if possible clean it after each test. """ Any other comments? Stephen [1] https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot <[email protected]> wrote: Hi Stephen, My comments are inline Le 19/01/2017 à 20:32, Stephen Sisk a écrit : > I definitely agree that sharing resources between tests is more efficient. > > Etienne - do you think it's necessary to separate the IT from the data > loading script? Actually, I see separation between IT and loading script more as a an improvement (time and resource effective) not as a necessity. Indeed, for now, for example, loading in ES IT is done within the IT (see https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT) > The postgres/JdbcIOIT can use the natural namespacing of > tables and I feel pretty comfortable that will work well over time. You mean using the same table name with different namespace? But IMHO, it is still "using another place" that I mentioned, read IT and write IT could use same table name in different namespaces. > You > haven't explicitly mentioned it, but I'm assuming that elasticsearch > doesn't allow such namespacing, so that's why you're having to do the > separation? Actually in ES, there is no namespace notion but there is index name. The index is the documents storing entity that is split. And there is the document type that is more like a class definition for the document. So basically, we could have read IT using readIndex.docType and write IT using writeIndex.docType. > I'm not trying to discourage separating data load from IT, just > wondering whether it's truly necessary. IMHO, more like an optimization like I mentioned. > > I was trying to consolidate what we're discussed down to a few guidelines. > I think those are that IO ITs: > 1. Can assume the data load script has been run before the test (unless the > data load script is run by the test itself) I Agree > 2. Must work if they are run multiple times without the data load script > being run in between (ie, they should clean up after themselves or use > namespacing such that tests don't interfere with one another) Yes, sure > 3. Tests that generate large amounts of data will attempt to clean up after > themselves. (ie, if you just write 100 rows, don't worry about it - if you > write 5 gb of data, you'd need to clean up.) We will not assume this will > always succeed in cleaning up, but my assumption is that if a particular > data store gets into a bad state, we'll just destroy/recreate that > particular data store. I am more reserved on that one regarding flakiness. IMHO, it is better to clean in all cases. I mentioned in a thread that sharding in the datastore might change depending on data volume (it is not he case for ES because the sharding is defined by configuration) or a shard/partition in the datastore can become so big that it will be split more by the IO. Imagine that a test that writes 100 rows does not do cleanup and is run 1 000 times, then the storage entity becomes bigger and bigger and it might then be split into more bundles than asserted in split tests (either by decision of the datastore or because desiredBundleSize is small) > > If the tests follow those assumptions, then that should support all the > scenarios I can think of: running data store create + data load script > occasionally (say, once a week or month) all the way up to running them > once per test run (if we decided to go that far.) Yes but do we chose to enforce a standard way of coding integration tests such as - loading data is done by and exterior loading script - read tests: do not load data, do not clean data - write tests: use another storage place than read tests (using namespace for example) and clean it after each test. ? Etienne > > S > > On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot <[email protected]> > wrote: > > Hi, > > Yes, thanks all for these clarifications about testing architecture. > > I agree that point 1 and 2 should be shared between tests as much as > possible. Especially sharing data loading between tests is more > time-effective and resource-effective: tests that need data (testRead, > testSplit, ...) will save the loading time, the wait for asynchronous > indexation and cleaning time. Just a small comment: > > If we share the data loading between tests, then tests that expect an > empty dataset (testWrite, ...), obviously cannot clear the shared dataset. > > So they will need to write to a dedicated place (other than read tests) > and clean it afterwards. > > I will update ElasticSearch read IT > (https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT) > to not do data loading/cleaning and write IT to use another location > than read IT > > Etienne > > Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit : >> Hi guys, >> >> Firs, great e-mail Stephen: complete and detailed proposal. >> >> Lukasz raised a good point: it makes sense to be able to leverage the >> same "bootstrap" script. >> >> We discussed about providing the following in each IO: >> 1. code to load data (java, script, whatever) >> 2. script to bootstrap the backend (dockerfile, kubernetes script, ...) >> 3. actual integration tests >> >> Only 3 is specific to the IO: 1 and 2 can be the same either if we run >> integration tests for Python or integration tests for Java SDKs. >> >> However, 3 may depend to 1 and 2 (the integration tests perform some >> assertion based on the loaded data for instance). >> Today, correct me if I'm wrong, but 1 and 2 will be executed by hand >> or by Jenkins using a "description" of where the code and script are >> located. >> >> So, I think that we can put 1 and 2 in the IO and use "descriptor" to >> do the bootstrapping. >> >> Regards >> JB >> >> On 01/17/2017 04:37 PM, Lukasz Cwik wrote: >>> Since docker containers can run a script on startup, can we embed the >>> initial data set into that script/container build so that the same >>> docker >>> container and initial data set can be used across multiple ITs. For >>> example, if Python and Java both have JdbcIO, it would be nice if they >>> could leverage the same docker container with the same data set to >>> ensure >>> the same pipeline produces the same results? >>> >>> This would be different from embedding the data in the specific IT >>> implementation and would also create a coupling between ITs from >>> potentially multiple languages. >>> >>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <[email protected]> >>> wrote: >>> >>>> Hi all! >>>> >>>> As I've discussed previously on this list[1], ensuring that we have >>>> high >>>> quality IO Transforms is important to beam. We want to do this without >>>> adding too much burden on developers wanting to contribute. Below I >>>> have a >>>> concrete proposal for what an IO integration test would look like >>>> and an >>>> example integration test[4] that meets those requirements. >>>> >>>> Proposal: we should require that an IO transform includes a passing >>>> integration test showing the IO can connect to real instance of the >>>> data >>>> store. We still want/expect comprehensive unit tests on an IO >>>> transform, >>>> but we would allow check ins with just some unit tests in the >>>> presence of >>>> an IT. >>>> >>>> To support that, we'll require the following pieces associated with >>>> an IT: >>>> >>>> 1. Dockerfile that can be used to create a running instance of the data >>>> store. We've previously discussed on this list that we would use docker >>>> images running inside kubernetes or mesos[2], and I'd prefer having a >>>> kubernetes/mesos script to start a given data store, but for a single >>>> instance data store, we can take a dockerfile and use it to create a >>>> simple >>>> kubernetes/mesos app. If you have questions about how maintaining the >>>> containers long term would work, check [2] as I discussed a detailed >>>> plan >>>> there. >>>> >>>> 2. Code to load test data on the data store created by #1. Needs to >>>> be self >>>> contained. For now, the easiest way to do this would be to have code >>>> inside >>>> of the IT. >>>> >>>> 3. The IT. I propose keeping this inside of the same module as the IO >>>> transform itself since having all the IO transform ITs in one module >>>> would >>>> mean there may be conflicts between different data store's >>>> dependencies. >>>> Integration tests will need connection information pointing to the data >>>> store it is testing. As discussed previously on this list[3], it should >>>> receive that connection information via TestPipelineOptions. >>>> >>>> I'd like to get something up and running soon so people checking in >>>> new IO >>>> transforms can start taking advantage of an IT framework. Thus, >>>> there are a >>>> couple simplifying assumptions in this plan. Pieces of the plan that I >>>> anticipate will evolve: >>>> >>>> 1. The test data load script - we would like to write these in a >>>> uniform >>>> way and especially ensure that the test data is cleaned up after the >>>> tests >>>> run. >>>> >>>> 2. Spinning up/down instances - for now, we'd likely need to do this >>>> manually. It'd be good to get an automated process for this. That's >>>> especially critical for performance tests with multiple nodes - >>>> there's no >>>> need to keep instances running for that. >>>> >>>> Integrating closer with PKB would be a good way to do both of these >>>> things, >>>> but first let's focus on getting some basic ITs running. >>>> >>>> As a concrete example of this proposal, I've written JDBC IO IT [4]. >>>> JdbcIOTest already did a lot of test setup, so I heavily re-used it. >>>> The >>>> key pieces: >>>> >>>> * The integration test is in JdbcIOIT. >>>> >>>> * JdbcIOIT reads the TestPipelineOptions defined in >>>> PostgresTestOptions. We >>>> may move the TestOptions files into a common place so they can be >>>> shared >>>> between tests. >>>> >>>> * Test data is created/cleaned up inside of the IT. >>>> >>>> * kubernetes/mesos scripts - I have provided examples of both under the >>>> "jdbc/src/test/resources" directory, but I'd like us to decide as a >>>> project >>>> which container orchestration service we want to use - I'll send >>>> mail about >>>> that shortly. >>>> >>>> thanks! >>>> Stephen >>>> >>>> [1] Integration Testing Sources >>>> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a >>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E >>>> >>>> [2] Container Orchestration software for hosting data stores >>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0 >>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E >>>> >>>> [3] Some Thoughts on IO Integration Tests >>>> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0 >>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E >>>> >>>> [4] JDBC IO IT using postgres >>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc - >>>> have not >>>> been reviewed yet, so may contain code errors, but it does run & >>>> pass :) >>>>
