Re: Hosting data stores for IO Transform testing

Sourabh Bajaj Tue, 22 Nov 2016 11:39:59 -0800

Makes sense, thanks for answering.

On Tue, Nov 22, 2016 at 11:24 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:


> Hi Sourabh,
>
> We raised the IO versioning point couple of months ago on the mailing list.
>
> Basically, we have two options:
>
> 1. Same modules (for example sdks/java/io/kafka) with one branch per
> version (kafka-0.8 kafka-0.10)
> 2. Several modules: sdks/java/io/kafka-0.8 sdks/java/io/kafka-0.10
>
> My preferences is on 2:
> Pros:
> - the IO can still be part of the main Beam release
> - it's more visible for contribution
> Cons:
> - we might have code duplication
>
> Regards
> JB
>
> On 11/22/2016 08:12 PM, Sourabh Bajaj wrote:
> > Hi,
> >
> > One tangential question I had around the proposal was how do we currently
> > deal with versioning in IO sources/sinks.
> >
> > For example Cassandra 1.2 vs 2.1 have some differences between them, so
> the
> > checked in sources and sink probably supports a particular version right
> > now. If yes, follow questions would be around how do we handle updating ?
> > deprecating and documenting the supported versions.
> >
> > I can move this to a new thread if this seems like a different
> discussion.
> > Also if this has already been answered please feel free to direct me to a
> > doc or past thread.
> >
> > Thanks
> > Sourabh
> >
> > On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <ieme...@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> @Stephen Thanks for your proposal, it is really interesting, I would
> really
> >> like to help with this. I have never played with Kubernetes but this
> seems
> >> a really nice chance to do something useful with it.
> >>
> >> We (at Talend) are testing most of the IOs using simple container images
> >> and in some particular cases ‘clusters’ of containers using
> docker-compose
> >> (a little bit like Amit’s (2) proposal). It would be really nice to have
> >> this at the Beam level, in particular to try to test more complex
> >> semantics, I don’t know how programmable kubernetes is to achieve this
> for
> >> example:
> >>
> >> Let’s think we have a cluster of Cassandra or Kafka nodes, I would like
> to
> >> have programmatic tests to simulate failure (e.g. kill a node), or
> simulate
> >> a really slow node, to ensure that the IO behaves as expected in the
> Beam
> >> pipeline for the given runner.
> >>
> >> Another related idea is to improve IO consistency: Today the different
> IOs
> >> have small differences in their failure behavior, I really would like
> to be
> >> able to predict with more precision what will happen in case of errors,
> >> e.g. what is the correct behavior if I am writing to a Kafka node and
> there
> >> is a network partition, does the Kafka sink retries or no ? and what if
> it
> >> is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or
> do
> >> we guarantee exactly once writes somehow?, today I am not sure about
> what
> >> happens (or if the expected behavior depends on the runner), but well
> maybe
> >> it is just that I don’t know and we have tests to ensure this.
> >>
> >> Of course both are really hard problems, but I think with your proposal
> we
> >> can try to tackle them, as well as the performance ones. And apart of
> the
> >> data stores, I think it will be also really nice to be able to test the
> >> runners in a distributed manner.
> >>
> >> So what is the next step? How do you imagine such integration tests? ?
> Who
> >> can provide the test machines so we can mount the cluster?
> >>
> >> Maybe my ideas are a bit too far away for an initial setup, but it will
> be
> >> really nice to start working on this.
> >>
> >> Ismael
> >>
> >>
> >> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <amitsel...@gmail.com>
> wrote:
> >>
> >>> Hi Stephen,
> >>>
> >>> I was wondering about how we plan to use the data stores across
> >> executions.
> >>>
> >>> Clearly, it's best to setup a new instance (container) for every test,
> >>> running a "standalone" store (say HBase/Cassandra for example), and
> once
> >>> the test is done, teardown the instance. It should also be agnostic to
> >> the
> >>> runtime environment (e.g., Docker on Kubernetes).
> >>> I'm wondering though what's the overhead of managing such a deployment
> >>> which could become heavy and complicated as more IOs are supported and
> >> more
> >>> test cases introduced.
> >>>
> >>> Another way to go would be to have small clusters of different data
> >> stores
> >>> and run against new "namespaces" (while lazily evicting old ones), but
> I
> >>> think this is less likely as maintaining a distributed instance (even a
> >>> small one) for each data store sounds even more complex.
> >>>
> >>> A third approach would be to to simply have an "embedded" in-memory
> >>> instance of a data store as part of a test that runs against it (such
> as
> >> an
> >>> embedded Kafka, though not a data store).
> >>> This is probably the simplest solution in terms of orchestration, but
> it
> >>> depends on having a proper "embedded" implementation for an IO.
> >>>
> >>> Does this make sense to you ? have you considered it ?
> >>>
> >>> Thanks,
> >>> Amit
> >>>
> >>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> >>> wrote:
> >>>
> >>>> Hi Stephen,
> >>>>
> >>>> as already discussed a bit together, it sounds great ! Especially I
> >> like
> >>>> it as a both integration test platform and good coverage for IOs.
> >>>>
> >>>> I'm very late on this but, as said, I will share with you my Marathon
> >>>> JSON and Mesos docker images.
> >>>>
> >>>> By the way, I started to experiment a bit kubernetes and swamp but
> it's
> >>>> not yet complete. I will share what I have on the same github repo.
> >>>>
> >>>> Thanks !
> >>>> Regards
> >>>> JB
> >>>>
> >>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote:
> >>>>> Hi everyone!
> >>>>>
> >>>>> Currently we have a good set of unit tests for our IO Transforms -
> >>> those
> >>>>> tend to run against in-memory versions of the data stores. However,
> >>> we'd
> >>>>> like to further increase our test coverage to include running them
> >>>> against
> >>>>> real instances of the data stores that the IO Transforms work against
> >>>> (e.g.
> >>>>> cassandra, mongodb, kafka, etc…), which means we'll need to have real
> >>>>> instances of various data stores.
> >>>>>
> >>>>> Additionally, if we want to do performance regression detection, it's
> >>>>> important to have instances of the services that behave
> >> realistically,
> >>>>> which isn't true of in-memory or dev versions of the services.
> >>>>>
> >>>>>
> >>>>> Proposed solution
> >>>>> -------------------------
> >>>>> If we accept this proposal, we would create an infrastructure for
> >>> running
> >>>>> real instances of data stores inside of containers, using container
> >>>>> management software like mesos/marathon, kubernetes, docker swarm,
> >> etc…
> >>>> to
> >>>>> manage the instances.
> >>>>>
> >>>>> This would enable us to build integration tests that run against
> >> those
> >>>> real
> >>>>> instances and performance tests that run against those real instances
> >>>> (like
> >>>>> those that Jason Kuster is proposing elsewhere.)
> >>>>>
> >>>>>
> >>>>> Why do we need one centralized set of instances vs just having
> >> various
> >>>>> people host their own instances?
> >>>>> -------------------------
> >>>>> Reducing flakiness of tests is key. By not having dependencies from
> >> the
> >>>>> core project on external services/instances of data stores we have
> >>>>> guaranteed access to the services and the group can fix issues that
> >>>> arise.
> >>>>>
> >>>>> An exception would be something that has an ops team supporting it
> >> (eg,
> >>>>> AWS, Google Cloud or other professionally managed service) - those we
> >>>> trust
> >>>>> will be stable.
> >>>>>
> >>>>>
> >>>>> There may be a lot of different data stores needed - how will we
> >>> maintain
> >>>>> them?
> >>>>> -------------------------
> >>>>> It will take work above and beyond that of a normal set of unit tests
> >>> to
> >>>>> build and maintain integration/performance tests & their data store
> >>>>> instances.
> >>>>>
> >>>>> Setup & maintenance of the data store containers and data store
> >>> instances
> >>>>> on it must be automated. It also has to be as simple of a setup as
> >>>>> possible, and we should avoid hand tweaking the containers -
> >> expecting
> >>>>> checked in scripts/dockerfiles is key.
> >>>>>
> >>>>> Aligned with the community ownership approach of Apache, as members
> >> of
> >>>> the
> >>>>> community are excited to contribute & maintain those tests and the
> >>>>> integration/performance tests, people will be able to step up and do
> >>>> that.
> >>>>> If there is no longer support for maintaining a particular set of
> >>>>> integration & performance tests and their data store instances, then
> >> we
> >>>> can
> >>>>> disable those tests. We may document on the website what IO
> >> Transforms
> >>>> have
> >>>>> current integration/performance tests so users know what level of
> >>> testing
> >>>>> the various IO Transforms have.
> >>>>>
> >>>>>
> >>>>> What about requirements for the container management software itself?
> >>>>> -------------------------
> >>>>> * We should have the data store instances themselves in Docker.
> >> Docker
> >>>>> allows new instances to be spun up in a quick, reproducible way and
> >> is
> >>>>> fairly platform independent. It has wide support from a variety of
> >>>>> different container management services.
> >>>>> * As little admin work required as possible. Crashing instances
> >> should
> >>> be
> >>>>> restarted, setup should be simple, everything possible should be
> >>>>> scripted/scriptable.
> >>>>> * Logs and test output should be on a publicly available website,
> >>> without
> >>>>> needing to log into test execution machine. Centralized capture of
> >>>>> monitoring info/logs from instances running in the containers would
> >>>> support
> >>>>> this. Ideally, this would just be supported by the container software
> >>> out
> >>>>> of the box.
> >>>>> * It'd be useful to have good persistent volume in the container
> >>>> management
> >>>>> software so that databases don't have to reload large data sets every
> >>>> time.
> >>>>> * The containers may be a place to execute runners themselves if we
> >>> need
> >>>>> larger runner instances, so it should play well with Spark, Flink,
> >> etc…
> >>>>>
> >>>>> As I discussed earlier on the mailing list, it looks like hosting
> >>> docker
> >>>>> containers on kubernetes, docker swarm or mesos+marathon would be a
> >>> good
> >>>>> solution.
> >>>>>
> >>>>> Thanks,
> >>>>> Stephen Sisk
> >>>>>
> >>>>
> >>>> --
> >>>> Jean-Baptiste Onofré
> >>>> jbono...@apache.org
> >>>> http://blog.nanthrax.net
> >>>> Talend - http://www.talend.com
> >>>>
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Hosting data stores for IO Transform testing

Reply via email to