Re: IO Integration tests - concrete proposal

Stephen Sisk Wed, 25 Jan 2017 17:26:06 -0800

hi JB!

"IO Writing Guide" sounds like BEAM-1025 (User guide - "How to create Beam
IO Transforms") that I've been working on. Let me pull together the stuff
I've been working on into a draft that folks can take a look at. I had an
earlier draft that was more focused on sources/sinks but since we're moving
away from those, I started a re-write. I'll aim for end of week for sharing
a draft.


There's also a section about fakes in the testing doc:
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.cykbne9o4iv


Sorry the testing doc/how to create user guide have sat in draft form for a
while, I've wanted to finish up the integration testing environment for IOs
first.

S

On Wed, Jan 25, 2017 at 8:52 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi

It's what I mentioned in a previous email yup. It should refer a "IO
Writing Guide⁣" describing the purpose of service interface, fake/mock, ...

I will tackle that in a PR.

Regards
JB

On Jan 25, 2017, 09:54, at 09:54, Etienne Chauchot <echauc...@gmail.com>
wrote:
>Hey Stephen,
>
>That seems perfect!
>
>Another thing, more about software design, maybe you could add in the
>guide comments what have been discussed in the ML about making standard
>
>the use of:
>
>- IOService interface in UT and IT,
>
>- implementations EmbeddedIOService and MockIOServcice for UT
>
>- implementation RealIOService for IT (name proposal)
>
>if we all have an agreement on these points. Maybe it requires some
>more
>discussions (methods in the interface, are almost passthrough
>implementations -EmbeddedIOService, RealIOService - needed, ...)
>
>Etienne
>
>
>Le 24/01/2017 à 06:47, Stephen Sisk a écrit :
>> hey,
>>
>> thanks - these are good questions/thoughts.
>>
>>> I am more reserved on that one regarding flakiness. IMHO, it is
>better to
>> clean in all cases.
>>
>> I strongly agree that we should attempt to clean in each case, and
>the
>> system should support that. I should have stated that more firmly. As
>I
>> think about it more, you're also right that we should just not try to
>do
>> the data loading inside of the test. I amended the guidelines based
>on your
>> comments and put them in the draft "Testing IO transforms in Apache
>Beam"
>> doc that I've been working on [1].
>>
>> Here's that snippet:
>> """
>>
>> For both types of tests (integration and performance), you'll need to
>have
>> scripts that set up your test data - they will be run independent of
>the
>> tests themselves.
>>
>> The Integration and Perf Tests themselves:
>>
>> 1. Can assume the data load script has been run before the test
>>
>> 2. Must work if they are run multiple times without the data load
>script
>> being run in between (ie, they should clean up after themselves or
>use
>> namespacing such that tests don't interfere with one another)
>>
>> 3. Read tests must not load data or clean data
>>
>> 4. Write tests must use another storage location than read tests
>(using
>> namespace/table names/etc.. for example) and if possible clean it
>after
>> each test.
>> """
>>
>> Any other comments?
>>
>> Stephen
>>
>> [1]
>>
>
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
>>
>> On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot
><echauc...@gmail.com>
>> wrote:
>>
>> Hi Stephen,
>>
>> My comments are inline
>>
>> Le 19/01/2017 à 20:32, Stephen Sisk a écrit :
>>> I definitely agree that sharing resources between tests is more
>efficient.
>>>
>>> Etienne - do you think it's necessary to separate the IT from the
>data
>>> loading script?
>> Actually, I see separation between IT and loading script more as a an
>> improvement (time and resource effective) not as a necessity. Indeed,
>> for now, for example, loading in ES IT is done within the IT (see
>> https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>
>>> The postgres/JdbcIOIT can use the natural namespacing of
>>> tables and I feel pretty comfortable that will work well over time.
>> You mean using the same table name with different namespace? But
>IMHO,
>> it is still "using another place" that I mentioned, read IT and write
>IT
>> could use same table name in different namespaces.
>>>    You
>>> haven't explicitly mentioned it, but I'm assuming that elasticsearch
>>> doesn't allow such namespacing, so that's why you're having to do
>the
>>> separation?
>> Actually in ES, there is no namespace notion but there is index name.
>> The index is the documents storing entity that is split. And there is
>> the document type that is more like a class definition for the
>document.
>> So basically, we could have read IT using readIndex.docType and write
>IT
>> using writeIndex.docType.
>>> I'm not trying to discourage separating data load from IT, just
>>> wondering whether it's truly necessary.
>> IMHO, more like an optimization like I mentioned.
>>> I was trying to consolidate what we're discussed down to a few
>guidelines.
>>> I think those are that IO ITs:
>>> 1. Can assume the data load script has been run before the test
>(unless
>> the
>>> data load script is run by the test itself)
>> I Agree
>>> 2. Must work if they are run multiple times without the data load
>script
>>> being run in between (ie, they should clean up after themselves or
>use
>>> namespacing such that tests don't interfere with one another)
>> Yes, sure
>>> 3. Tests that generate large amounts of data will attempt to clean
>up
>> after
>>> themselves. (ie, if you just write 100 rows, don't worry about it -
>if you
>>> write 5 gb of data, you'd need to clean up.) We will not assume this
>will
>>> always succeed in cleaning up, but my assumption is that if a
>particular
>>> data store gets into a bad state, we'll just destroy/recreate that
>>> particular data store.
>> I am more reserved on that one regarding flakiness. IMHO, it is
>better
>> to clean in all cases. I mentioned in a thread that sharding in the
>> datastore might change depending on data volume (it is not he case
>for
>> ES because the sharding is defined by configuration) or a
>> shard/partition in the datastore can become so big that it will be
>split
>> more by the IO. Imagine that a test that writes 100 rows does not do
>> cleanup and is run 1 000 times, then the storage entity becomes
>bigger
>> and bigger and it might then be split into more bundles than asserted
>in
>> split tests (either by decision of the datastore or because
>> desiredBundleSize is small)
>>> If the tests follow those assumptions, then that should support all
>the
>>> scenarios I can think of: running data store create + data load
>script
>>> occasionally (say, once a week or month) all the way up to running
>them
>>> once per test run (if we decided to go that far.)
>> Yes but do we chose to enforce a standard way of coding integration
>> tests such as
>> - loading data is done by and exterior loading script
>> - read tests: do not load data,  do not clean data
>> - write tests: use another storage place than read tests (using
>> namespace for example) and clean it after each test.
>> ?
>>
>> Etienne
>>> S
>>>
>>> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot
><echauc...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Yes, thanks all for these clarifications about testing architecture.
>>>
>>> I agree that point 1 and 2 should be shared between tests as much as
>>> possible. Especially sharing data loading between tests is more
>>> time-effective and resource-effective: tests that need data
>(testRead,
>>> testSplit, ...) will save the loading time, the wait for
>asynchronous
>>> indexation and cleaning time. Just a small comment:
>>>
>>> If we share the data loading between tests, then tests that expect
>an
>>> empty dataset (testWrite, ...), obviously cannot clear the shared
>dataset.
>>>
>>> So they will need to write to a dedicated place (other than read
>tests)
>>> and clean it afterwards.
>>>
>>> I will update ElasticSearch read IT
>>>
>(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>> to not do data loading/cleaning and write IT to use another location
>>> than read IT
>>>
>>> Etienne
>>>
>>> Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit :
>>>> Hi guys,
>>>>
>>>> Firs, great e-mail Stephen: complete and detailed proposal.
>>>>
>>>> Lukasz raised a good point: it makes sense to be able to leverage
>the
>>>> same "bootstrap" script.
>>>>
>>>> We discussed about providing the following in each IO:
>>>> 1. code to load data (java, script, whatever)
>>>> 2. script to bootstrap the backend (dockerfile, kubernetes script,
>...)
>>>> 3. actual integration tests
>>>>
>>>> Only 3 is specific to the IO: 1 and 2 can be the same either if we
>run
>>>> integration tests for Python or integration tests for Java SDKs.
>>>>
>>>> However,  3 may depend to 1 and 2 (the integration tests perform
>some
>>>> assertion based on the loaded data for instance).
>>>> Today, correct me if I'm  wrong, but 1 and 2 will be executed by
>hand
>>>> or by Jenkins using a "description" of where the code and script
>are
>>>> located.
>>>>
>>>> So, I think that we can put 1 and 2 in the IO and use "descriptor"
>to
>>>> do the bootstrapping.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>>>>> Since docker containers can run a script on startup, can we embed
>the
>>>>> initial data set into that script/container build so that the same
>>>>> docker
>>>>> container and initial data set can be used across multiple ITs.
>For
>>>>> example, if Python and Java both have JdbcIO, it would be nice if
>they
>>>>> could leverage the same docker container with the same data set to
>>>>> ensure
>>>>> the same pipeline produces the same results?
>>>>>
>>>>> This would be different from embedding the data in the specific IT
>>>>> implementation and would also create a coupling between ITs from
>>>>> potentially multiple languages.
>>>>>
>>>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk
><s...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi all!
>>>>>>
>>>>>> As I've discussed previously on this list[1], ensuring that we
>have
>>>>>> high
>>>>>> quality IO Transforms is important to beam. We want to do this
>without
>>>>>> adding too much burden on developers wanting to contribute. Below
>I
>>>>>> have a
>>>>>> concrete proposal for what an IO integration test would look like
>>>>>> and an
>>>>>> example integration test[4] that meets those requirements.
>>>>>>
>>>>>> Proposal: we should require that an IO transform includes a
>passing
>>>>>> integration test showing the IO can connect to real instance of
>the
>>>>>> data
>>>>>> store. We still want/expect comprehensive unit tests on an IO
>>>>>> transform,
>>>>>> but we would allow check ins with just some unit tests in the
>>>>>> presence of
>>>>>> an IT.
>>>>>>
>>>>>> To support that, we'll require the following pieces associated
>with
>>>>>> an IT:
>>>>>>
>>>>>> 1. Dockerfile that can be used to create a running instance of
>the data
>>>>>> store. We've previously discussed on this list that we would use
>docker
>>>>>> images running inside kubernetes or mesos[2], and I'd prefer
>having a
>>>>>> kubernetes/mesos script to start a given data store, but for a
>single
>>>>>> instance data store, we can take a dockerfile and use it to
>create a
>>>>>> simple
>>>>>> kubernetes/mesos app. If you have questions about how maintaining
>the
>>>>>> containers long term would work, check [2] as I discussed a
>detailed
>>>>>> plan
>>>>>> there.
>>>>>>
>>>>>> 2. Code to load test data on the data store created by #1. Needs
>to
>>>>>> be self
>>>>>> contained. For now, the easiest way to do this would be to have
>code
>>>>>> inside
>>>>>> of the IT.
>>>>>>
>>>>>> 3. The IT. I propose keeping this inside of the same module as
>the IO
>>>>>> transform itself since having all the IO transform ITs in one
>module
>>>>>> would
>>>>>> mean there may be conflicts between different data store's
>>>>>> dependencies.
>>>>>> Integration tests will need connection information pointing to
>the data
>>>>>> store it is testing. As discussed previously on this list[3], it
>should
>>>>>> receive that connection information via TestPipelineOptions.
>>>>>>
>>>>>> I'd like to get something up and running soon so people checking
>in
>>>>>> new IO
>>>>>> transforms can start taking advantage of an IT framework. Thus,
>>>>>> there are a
>>>>>> couple simplifying assumptions in this plan. Pieces of the plan
>that I
>>>>>> anticipate will evolve:
>>>>>>
>>>>>> 1. The test data load script - we would like to write these in a
>>>>>> uniform
>>>>>> way and especially ensure that the test data is cleaned up after
>the
>>>>>> tests
>>>>>> run.
>>>>>>
>>>>>> 2. Spinning up/down instances - for now, we'd likely need to do
>this
>>>>>> manually. It'd be good to get an automated process for this.
>That's
>>>>>> especially critical for performance tests with multiple nodes -
>>>>>> there's no
>>>>>> need to keep instances running for that.
>>>>>>
>>>>>> Integrating closer with PKB would be a good way to do both of
>these
>>>>>> things,
>>>>>> but first let's focus on getting some basic ITs running.
>>>>>>
>>>>>> As a concrete example of this proposal, I've written JDBC IO IT
>[4].
>>>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used
>it.
>>>>>> The
>>>>>> key pieces:
>>>>>>
>>>>>> * The integration test is in JdbcIOIT.
>>>>>>
>>>>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>>>>> PostgresTestOptions. We
>>>>>> may move the TestOptions files into a common place so they can be
>>>>>> shared
>>>>>> between tests.
>>>>>>
>>>>>> * Test data is created/cleaned up inside of the IT.
>>>>>>
>>>>>> * kubernetes/mesos scripts - I have provided examples of both
>under the
>>>>>> "jdbc/src/test/resources" directory, but I'd like us to decide as
>a
>>>>>> project
>>>>>> which container orchestration service we want to use - I'll send
>>>>>> mail about
>>>>>> that shortly.
>>>>>>
>>>>>> thanks!
>>>>>> Stephen
>>>>>>
>>>>>> [1] Integration Testing Sources
>>>>>>
>https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> [2] Container Orchestration software for hosting data stores
>>>>>>
>https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> [3] Some Thoughts on IO Integration Tests
>>>>>>
>https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> [4] JDBC IO IT using postgres
>>>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -
>>>>>> have not
>>>>>> been reviewed yet, so may contain code errors, but it does run &
>>>>>> pass :)
>>>>>>

Re: IO Integration tests - concrete proposal

Reply via email to