hi JB! "IO Writing Guide" sounds like BEAM-1025 (User guide - "How to create Beam IO Transforms") that I've been working on. Let me pull together the stuff I've been working on into a draft that folks can take a look at. I had an earlier draft that was more focused on sources/sinks but since we're moving away from those, I started a re-write. I'll aim for end of week for sharing a draft.
There's also a section about fakes in the testing doc: https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.cykbne9o4iv Sorry the testing doc/how to create user guide have sat in draft form for a while, I've wanted to finish up the integration testing environment for IOs first. S On Wed, Jan 25, 2017 at 8:52 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: Hi It's what I mentioned in a previous email yup. It should refer a "IO Writing Guide" describing the purpose of service interface, fake/mock, ... I will tackle that in a PR. Regards JB On Jan 25, 2017, 09:54, at 09:54, Etienne Chauchot <echauc...@gmail.com> wrote: >Hey Stephen, > >That seems perfect! > >Another thing, more about software design, maybe you could add in the >guide comments what have been discussed in the ML about making standard > >the use of: > >- IOService interface in UT and IT, > >- implementations EmbeddedIOService and MockIOServcice for UT > >- implementation RealIOService for IT (name proposal) > >if we all have an agreement on these points. Maybe it requires some >more >discussions (methods in the interface, are almost passthrough >implementations -EmbeddedIOService, RealIOService - needed, ...) > >Etienne > > >Le 24/01/2017 à 06:47, Stephen Sisk a écrit : >> hey, >> >> thanks - these are good questions/thoughts. >> >>> I am more reserved on that one regarding flakiness. IMHO, it is >better to >> clean in all cases. >> >> I strongly agree that we should attempt to clean in each case, and >the >> system should support that. I should have stated that more firmly. As >I >> think about it more, you're also right that we should just not try to >do >> the data loading inside of the test. I amended the guidelines based >on your >> comments and put them in the draft "Testing IO transforms in Apache >Beam" >> doc that I've been working on [1]. >> >> Here's that snippet: >> """ >> >> For both types of tests (integration and performance), you'll need to >have >> scripts that set up your test data - they will be run independent of >the >> tests themselves. >> >> The Integration and Perf Tests themselves: >> >> 1. Can assume the data load script has been run before the test >> >> 2. Must work if they are run multiple times without the data load >script >> being run in between (ie, they should clean up after themselves or >use >> namespacing such that tests don't interfere with one another) >> >> 3. Read tests must not load data or clean data >> >> 4. Write tests must use another storage location than read tests >(using >> namespace/table names/etc.. for example) and if possible clean it >after >> each test. >> """ >> >> Any other comments? >> >> Stephen >> >> [1] >> > https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m >> >> On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot ><echauc...@gmail.com> >> wrote: >> >> Hi Stephen, >> >> My comments are inline >> >> Le 19/01/2017 à 20:32, Stephen Sisk a écrit : >>> I definitely agree that sharing resources between tests is more >efficient. >>> >>> Etienne - do you think it's necessary to separate the IT from the >data >>> loading script? >> Actually, I see separation between IT and loading script more as a an >> improvement (time and resource effective) not as a necessity. Indeed, >> for now, for example, loading in ES IT is done within the IT (see >> https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT) >> >>> The postgres/JdbcIOIT can use the natural namespacing of >>> tables and I feel pretty comfortable that will work well over time. >> You mean using the same table name with different namespace? But >IMHO, >> it is still "using another place" that I mentioned, read IT and write >IT >> could use same table name in different namespaces. >>> You >>> haven't explicitly mentioned it, but I'm assuming that elasticsearch >>> doesn't allow such namespacing, so that's why you're having to do >the >>> separation? >> Actually in ES, there is no namespace notion but there is index name. >> The index is the documents storing entity that is split. And there is >> the document type that is more like a class definition for the >document. >> So basically, we could have read IT using readIndex.docType and write >IT >> using writeIndex.docType. >>> I'm not trying to discourage separating data load from IT, just >>> wondering whether it's truly necessary. >> IMHO, more like an optimization like I mentioned. >>> I was trying to consolidate what we're discussed down to a few >guidelines. >>> I think those are that IO ITs: >>> 1. Can assume the data load script has been run before the test >(unless >> the >>> data load script is run by the test itself) >> I Agree >>> 2. Must work if they are run multiple times without the data load >script >>> being run in between (ie, they should clean up after themselves or >use >>> namespacing such that tests don't interfere with one another) >> Yes, sure >>> 3. Tests that generate large amounts of data will attempt to clean >up >> after >>> themselves. (ie, if you just write 100 rows, don't worry about it - >if you >>> write 5 gb of data, you'd need to clean up.) We will not assume this >will >>> always succeed in cleaning up, but my assumption is that if a >particular >>> data store gets into a bad state, we'll just destroy/recreate that >>> particular data store. >> I am more reserved on that one regarding flakiness. IMHO, it is >better >> to clean in all cases. I mentioned in a thread that sharding in the >> datastore might change depending on data volume (it is not he case >for >> ES because the sharding is defined by configuration) or a >> shard/partition in the datastore can become so big that it will be >split >> more by the IO. Imagine that a test that writes 100 rows does not do >> cleanup and is run 1 000 times, then the storage entity becomes >bigger >> and bigger and it might then be split into more bundles than asserted >in >> split tests (either by decision of the datastore or because >> desiredBundleSize is small) >>> If the tests follow those assumptions, then that should support all >the >>> scenarios I can think of: running data store create + data load >script >>> occasionally (say, once a week or month) all the way up to running >them >>> once per test run (if we decided to go that far.) >> Yes but do we chose to enforce a standard way of coding integration >> tests such as >> - loading data is done by and exterior loading script >> - read tests: do not load data, do not clean data >> - write tests: use another storage place than read tests (using >> namespace for example) and clean it after each test. >> ? >> >> Etienne >>> S >>> >>> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot ><echauc...@gmail.com> >>> wrote: >>> >>> Hi, >>> >>> Yes, thanks all for these clarifications about testing architecture. >>> >>> I agree that point 1 and 2 should be shared between tests as much as >>> possible. Especially sharing data loading between tests is more >>> time-effective and resource-effective: tests that need data >(testRead, >>> testSplit, ...) will save the loading time, the wait for >asynchronous >>> indexation and cleaning time. Just a small comment: >>> >>> If we share the data loading between tests, then tests that expect >an >>> empty dataset (testWrite, ...), obviously cannot clear the shared >dataset. >>> >>> So they will need to write to a dedicated place (other than read >tests) >>> and clean it afterwards. >>> >>> I will update ElasticSearch read IT >>> >(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT) >>> to not do data loading/cleaning and write IT to use another location >>> than read IT >>> >>> Etienne >>> >>> Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit : >>>> Hi guys, >>>> >>>> Firs, great e-mail Stephen: complete and detailed proposal. >>>> >>>> Lukasz raised a good point: it makes sense to be able to leverage >the >>>> same "bootstrap" script. >>>> >>>> We discussed about providing the following in each IO: >>>> 1. code to load data (java, script, whatever) >>>> 2. script to bootstrap the backend (dockerfile, kubernetes script, >...) >>>> 3. actual integration tests >>>> >>>> Only 3 is specific to the IO: 1 and 2 can be the same either if we >run >>>> integration tests for Python or integration tests for Java SDKs. >>>> >>>> However, 3 may depend to 1 and 2 (the integration tests perform >some >>>> assertion based on the loaded data for instance). >>>> Today, correct me if I'm wrong, but 1 and 2 will be executed by >hand >>>> or by Jenkins using a "description" of where the code and script >are >>>> located. >>>> >>>> So, I think that we can put 1 and 2 in the IO and use "descriptor" >to >>>> do the bootstrapping. >>>> >>>> Regards >>>> JB >>>> >>>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote: >>>>> Since docker containers can run a script on startup, can we embed >the >>>>> initial data set into that script/container build so that the same >>>>> docker >>>>> container and initial data set can be used across multiple ITs. >For >>>>> example, if Python and Java both have JdbcIO, it would be nice if >they >>>>> could leverage the same docker container with the same data set to >>>>> ensure >>>>> the same pipeline produces the same results? >>>>> >>>>> This would be different from embedding the data in the specific IT >>>>> implementation and would also create a coupling between ITs from >>>>> potentially multiple languages. >>>>> >>>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk ><s...@google.com.invalid> >>>>> wrote: >>>>> >>>>>> Hi all! >>>>>> >>>>>> As I've discussed previously on this list[1], ensuring that we >have >>>>>> high >>>>>> quality IO Transforms is important to beam. We want to do this >without >>>>>> adding too much burden on developers wanting to contribute. Below >I >>>>>> have a >>>>>> concrete proposal for what an IO integration test would look like >>>>>> and an >>>>>> example integration test[4] that meets those requirements. >>>>>> >>>>>> Proposal: we should require that an IO transform includes a >passing >>>>>> integration test showing the IO can connect to real instance of >the >>>>>> data >>>>>> store. We still want/expect comprehensive unit tests on an IO >>>>>> transform, >>>>>> but we would allow check ins with just some unit tests in the >>>>>> presence of >>>>>> an IT. >>>>>> >>>>>> To support that, we'll require the following pieces associated >with >>>>>> an IT: >>>>>> >>>>>> 1. Dockerfile that can be used to create a running instance of >the data >>>>>> store. We've previously discussed on this list that we would use >docker >>>>>> images running inside kubernetes or mesos[2], and I'd prefer >having a >>>>>> kubernetes/mesos script to start a given data store, but for a >single >>>>>> instance data store, we can take a dockerfile and use it to >create a >>>>>> simple >>>>>> kubernetes/mesos app. If you have questions about how maintaining >the >>>>>> containers long term would work, check [2] as I discussed a >detailed >>>>>> plan >>>>>> there. >>>>>> >>>>>> 2. Code to load test data on the data store created by #1. Needs >to >>>>>> be self >>>>>> contained. For now, the easiest way to do this would be to have >code >>>>>> inside >>>>>> of the IT. >>>>>> >>>>>> 3. The IT. I propose keeping this inside of the same module as >the IO >>>>>> transform itself since having all the IO transform ITs in one >module >>>>>> would >>>>>> mean there may be conflicts between different data store's >>>>>> dependencies. >>>>>> Integration tests will need connection information pointing to >the data >>>>>> store it is testing. As discussed previously on this list[3], it >should >>>>>> receive that connection information via TestPipelineOptions. >>>>>> >>>>>> I'd like to get something up and running soon so people checking >in >>>>>> new IO >>>>>> transforms can start taking advantage of an IT framework. Thus, >>>>>> there are a >>>>>> couple simplifying assumptions in this plan. Pieces of the plan >that I >>>>>> anticipate will evolve: >>>>>> >>>>>> 1. The test data load script - we would like to write these in a >>>>>> uniform >>>>>> way and especially ensure that the test data is cleaned up after >the >>>>>> tests >>>>>> run. >>>>>> >>>>>> 2. Spinning up/down instances - for now, we'd likely need to do >this >>>>>> manually. It'd be good to get an automated process for this. >That's >>>>>> especially critical for performance tests with multiple nodes - >>>>>> there's no >>>>>> need to keep instances running for that. >>>>>> >>>>>> Integrating closer with PKB would be a good way to do both of >these >>>>>> things, >>>>>> but first let's focus on getting some basic ITs running. >>>>>> >>>>>> As a concrete example of this proposal, I've written JDBC IO IT >[4]. >>>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used >it. >>>>>> The >>>>>> key pieces: >>>>>> >>>>>> * The integration test is in JdbcIOIT. >>>>>> >>>>>> * JdbcIOIT reads the TestPipelineOptions defined in >>>>>> PostgresTestOptions. We >>>>>> may move the TestOptions files into a common place so they can be >>>>>> shared >>>>>> between tests. >>>>>> >>>>>> * Test data is created/cleaned up inside of the IT. >>>>>> >>>>>> * kubernetes/mesos scripts - I have provided examples of both >under the >>>>>> "jdbc/src/test/resources" directory, but I'd like us to decide as >a >>>>>> project >>>>>> which container orchestration service we want to use - I'll send >>>>>> mail about >>>>>> that shortly. >>>>>> >>>>>> thanks! >>>>>> Stephen >>>>>> >>>>>> [1] Integration Testing Sources >>>>>> >https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a >>>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E >>>>>> >>>>>> [2] Container Orchestration software for hosting data stores >>>>>> >https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0 >>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E >>>>>> >>>>>> [3] Some Thoughts on IO Integration Tests >>>>>> >https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0 >>>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E >>>>>> >>>>>> [4] JDBC IO IT using postgres >>>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc - >>>>>> have not >>>>>> been reviewed yet, so may contain code errors, but it does run & >>>>>> pass :) >>>>>>