Re: Hosting data stores for IO Transform testing

Sourabh Bajaj Tue, 22 Nov 2016 11:13:28 -0800

Hi,

One tangential question I had around the proposal was how do we currently
deal with versioning in IO sources/sinks.


For example Cassandra 1.2 vs 2.1 have some differences between them, so the
checked in sources and sink probably supports a particular version right
now. If yes, follow questions would be around how do we handle updating ?
deprecating and documenting the supported versions.

I can move this to a new thread if this seems like a different discussion.
Also if this has already been answered please feel free to direct me to a
doc or past thread.

Thanks
Sourabh

On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <[email protected]> wrote:

> Hello,
>
> @Stephen Thanks for your proposal, it is really interesting, I would really
> like to help with this. I have never played with Kubernetes but this seems
> a really nice chance to do something useful with it.
>
> We (at Talend) are testing most of the IOs using simple container images
> and in some particular cases ‘clusters’ of containers using docker-compose
> (a little bit like Amit’s (2) proposal). It would be really nice to have
> this at the Beam level, in particular to try to test more complex
> semantics, I don’t know how programmable kubernetes is to achieve this for
> example:
>
> Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to
> have programmatic tests to simulate failure (e.g. kill a node), or simulate
> a really slow node, to ensure that the IO behaves as expected in the Beam
> pipeline for the given runner.
>
> Another related idea is to improve IO consistency: Today the different IOs
> have small differences in their failure behavior, I really would like to be
> able to predict with more precision what will happen in case of errors,
> e.g. what is the correct behavior if I am writing to a Kafka node and there
> is a network partition, does the Kafka sink retries or no ? and what if it
> is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do
> we guarantee exactly once writes somehow?, today I am not sure about what
> happens (or if the expected behavior depends on the runner), but well maybe
> it is just that I don’t know and we have tests to ensure this.
>
> Of course both are really hard problems, but I think with your proposal we
> can try to tackle them, as well as the performance ones. And apart of the
> data stores, I think it will be also really nice to be able to test the
> runners in a distributed manner.
>
> So what is the next step? How do you imagine such integration tests? ? Who
> can provide the test machines so we can mount the cluster?
>
> Maybe my ideas are a bit too far away for an initial setup, but it will be
> really nice to start working on this.
>
> Ismael
>
>
> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <[email protected]> wrote:
>
> > Hi Stephen,
> >
> > I was wondering about how we plan to use the data stores across
> executions.
> >
> > Clearly, it's best to setup a new instance (container) for every test,
> > running a "standalone" store (say HBase/Cassandra for example), and once
> > the test is done, teardown the instance. It should also be agnostic to
> the
> > runtime environment (e.g., Docker on Kubernetes).
> > I'm wondering though what's the overhead of managing such a deployment
> > which could become heavy and complicated as more IOs are supported and
> more
> > test cases introduced.
> >
> > Another way to go would be to have small clusters of different data
> stores
> > and run against new "namespaces" (while lazily evicting old ones), but I
> > think this is less likely as maintaining a distributed instance (even a
> > small one) for each data store sounds even more complex.
> >
> > A third approach would be to to simply have an "embedded" in-memory
> > instance of a data store as part of a test that runs against it (such as
> an
> > embedded Kafka, though not a data store).
> > This is probably the simplest solution in terms of orchestration, but it
> > depends on having a proper "embedded" implementation for an IO.
> >
> > Does this make sense to you ? have you considered it ?
> >
> > Thanks,
> > Amit
> >
> > On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <[email protected]>
> > wrote:
> >
> > > Hi Stephen,
> > >
> > > as already discussed a bit together, it sounds great ! Especially I
> like
> > > it as a both integration test platform and good coverage for IOs.
> > >
> > > I'm very late on this but, as said, I will share with you my Marathon
> > > JSON and Mesos docker images.
> > >
> > > By the way, I started to experiment a bit kubernetes and swamp but it's
> > > not yet complete. I will share what I have on the same github repo.
> > >
> > > Thanks !
> > > Regards
> > > JB
> > >
> > > On 11/16/2016 11:36 PM, Stephen Sisk wrote:
> > > > Hi everyone!
> > > >
> > > > Currently we have a good set of unit tests for our IO Transforms -
> > those
> > > > tend to run against in-memory versions of the data stores. However,
> > we'd
> > > > like to further increase our test coverage to include running them
> > > against
> > > > real instances of the data stores that the IO Transforms work against
> > > (e.g.
> > > > cassandra, mongodb, kafka, etc…), which means we'll need to have real
> > > > instances of various data stores.
> > > >
> > > > Additionally, if we want to do performance regression detection, it's
> > > > important to have instances of the services that behave
> realistically,
> > > > which isn't true of in-memory or dev versions of the services.
> > > >
> > > >
> > > > Proposed solution
> > > > -------------------------
> > > > If we accept this proposal, we would create an infrastructure for
> > running
> > > > real instances of data stores inside of containers, using container
> > > > management software like mesos/marathon, kubernetes, docker swarm,
> etc…
> > > to
> > > > manage the instances.
> > > >
> > > > This would enable us to build integration tests that run against
> those
> > > real
> > > > instances and performance tests that run against those real instances
> > > (like
> > > > those that Jason Kuster is proposing elsewhere.)
> > > >
> > > >
> > > > Why do we need one centralized set of instances vs just having
> various
> > > > people host their own instances?
> > > > -------------------------
> > > > Reducing flakiness of tests is key. By not having dependencies from
> the
> > > > core project on external services/instances of data stores we have
> > > > guaranteed access to the services and the group can fix issues that
> > > arise.
> > > >
> > > > An exception would be something that has an ops team supporting it
> (eg,
> > > > AWS, Google Cloud or other professionally managed service) - those we
> > > trust
> > > > will be stable.
> > > >
> > > >
> > > > There may be a lot of different data stores needed - how will we
> > maintain
> > > > them?
> > > > -------------------------
> > > > It will take work above and beyond that of a normal set of unit tests
> > to
> > > > build and maintain integration/performance tests & their data store
> > > > instances.
> > > >
> > > > Setup & maintenance of the data store containers and data store
> > instances
> > > > on it must be automated. It also has to be as simple of a setup as
> > > > possible, and we should avoid hand tweaking the containers -
> expecting
> > > > checked in scripts/dockerfiles is key.
> > > >
> > > > Aligned with the community ownership approach of Apache, as members
> of
> > > the
> > > > community are excited to contribute & maintain those tests and the
> > > > integration/performance tests, people will be able to step up and do
> > > that.
> > > > If there is no longer support for maintaining a particular set of
> > > > integration & performance tests and their data store instances, then
> we
> > > can
> > > > disable those tests. We may document on the website what IO
> Transforms
> > > have
> > > > current integration/performance tests so users know what level of
> > testing
> > > > the various IO Transforms have.
> > > >
> > > >
> > > > What about requirements for the container management software itself?
> > > > -------------------------
> > > > * We should have the data store instances themselves in Docker.
> Docker
> > > > allows new instances to be spun up in a quick, reproducible way and
> is
> > > > fairly platform independent. It has wide support from a variety of
> > > > different container management services.
> > > > * As little admin work required as possible. Crashing instances
> should
> > be
> > > > restarted, setup should be simple, everything possible should be
> > > > scripted/scriptable.
> > > > * Logs and test output should be on a publicly available website,
> > without
> > > > needing to log into test execution machine. Centralized capture of
> > > > monitoring info/logs from instances running in the containers would
> > > support
> > > > this. Ideally, this would just be supported by the container software
> > out
> > > > of the box.
> > > > * It'd be useful to have good persistent volume in the container
> > > management
> > > > software so that databases don't have to reload large data sets every
> > > time.
> > > > * The containers may be a place to execute runners themselves if we
> > need
> > > > larger runner instances, so it should play well with Spark, Flink,
> etc…
> > > >
> > > > As I discussed earlier on the mailing list, it looks like hosting
> > docker
> > > > containers on kubernetes, docker swarm or mesos+marathon would be a
> > good
> > > > solution.
> > > >
> > > > Thanks,
> > > > Stephen Sisk
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > [email protected]
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: Hosting data stores for IO Transform testing

Reply via email to