Hi, One tangential question I had around the proposal was how do we currently deal with versioning in IO sources/sinks.
For example Cassandra 1.2 vs 2.1 have some differences between them, so the checked in sources and sink probably supports a particular version right now. If yes, follow questions would be around how do we handle updating ? deprecating and documenting the supported versions. I can move this to a new thread if this seems like a different discussion. Also if this has already been answered please feel free to direct me to a doc or past thread. Thanks Sourabh On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <ieme...@gmail.com> wrote: > Hello, > > @Stephen Thanks for your proposal, it is really interesting, I would really > like to help with this. I have never played with Kubernetes but this seems > a really nice chance to do something useful with it. > > We (at Talend) are testing most of the IOs using simple container images > and in some particular cases ‘clusters’ of containers using docker-compose > (a little bit like Amit’s (2) proposal). It would be really nice to have > this at the Beam level, in particular to try to test more complex > semantics, I don’t know how programmable kubernetes is to achieve this for > example: > > Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to > have programmatic tests to simulate failure (e.g. kill a node), or simulate > a really slow node, to ensure that the IO behaves as expected in the Beam > pipeline for the given runner. > > Another related idea is to improve IO consistency: Today the different IOs > have small differences in their failure behavior, I really would like to be > able to predict with more precision what will happen in case of errors, > e.g. what is the correct behavior if I am writing to a Kafka node and there > is a network partition, does the Kafka sink retries or no ? and what if it > is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do > we guarantee exactly once writes somehow?, today I am not sure about what > happens (or if the expected behavior depends on the runner), but well maybe > it is just that I don’t know and we have tests to ensure this. > > Of course both are really hard problems, but I think with your proposal we > can try to tackle them, as well as the performance ones. And apart of the > data stores, I think it will be also really nice to be able to test the > runners in a distributed manner. > > So what is the next step? How do you imagine such integration tests? ? Who > can provide the test machines so we can mount the cluster? > > Maybe my ideas are a bit too far away for an initial setup, but it will be > really nice to start working on this. > > Ismael > > > On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <amitsel...@gmail.com> wrote: > > > Hi Stephen, > > > > I was wondering about how we plan to use the data stores across > executions. > > > > Clearly, it's best to setup a new instance (container) for every test, > > running a "standalone" store (say HBase/Cassandra for example), and once > > the test is done, teardown the instance. It should also be agnostic to > the > > runtime environment (e.g., Docker on Kubernetes). > > I'm wondering though what's the overhead of managing such a deployment > > which could become heavy and complicated as more IOs are supported and > more > > test cases introduced. > > > > Another way to go would be to have small clusters of different data > stores > > and run against new "namespaces" (while lazily evicting old ones), but I > > think this is less likely as maintaining a distributed instance (even a > > small one) for each data store sounds even more complex. > > > > A third approach would be to to simply have an "embedded" in-memory > > instance of a data store as part of a test that runs against it (such as > an > > embedded Kafka, though not a data store). > > This is probably the simplest solution in terms of orchestration, but it > > depends on having a proper "embedded" implementation for an IO. > > > > Does this make sense to you ? have you considered it ? > > > > Thanks, > > Amit > > > > On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <j...@nanthrax.net> > > wrote: > > > > > Hi Stephen, > > > > > > as already discussed a bit together, it sounds great ! Especially I > like > > > it as a both integration test platform and good coverage for IOs. > > > > > > I'm very late on this but, as said, I will share with you my Marathon > > > JSON and Mesos docker images. > > > > > > By the way, I started to experiment a bit kubernetes and swamp but it's > > > not yet complete. I will share what I have on the same github repo. > > > > > > Thanks ! > > > Regards > > > JB > > > > > > On 11/16/2016 11:36 PM, Stephen Sisk wrote: > > > > Hi everyone! > > > > > > > > Currently we have a good set of unit tests for our IO Transforms - > > those > > > > tend to run against in-memory versions of the data stores. However, > > we'd > > > > like to further increase our test coverage to include running them > > > against > > > > real instances of the data stores that the IO Transforms work against > > > (e.g. > > > > cassandra, mongodb, kafka, etc…), which means we'll need to have real > > > > instances of various data stores. > > > > > > > > Additionally, if we want to do performance regression detection, it's > > > > important to have instances of the services that behave > realistically, > > > > which isn't true of in-memory or dev versions of the services. > > > > > > > > > > > > Proposed solution > > > > ------------------------- > > > > If we accept this proposal, we would create an infrastructure for > > running > > > > real instances of data stores inside of containers, using container > > > > management software like mesos/marathon, kubernetes, docker swarm, > etc… > > > to > > > > manage the instances. > > > > > > > > This would enable us to build integration tests that run against > those > > > real > > > > instances and performance tests that run against those real instances > > > (like > > > > those that Jason Kuster is proposing elsewhere.) > > > > > > > > > > > > Why do we need one centralized set of instances vs just having > various > > > > people host their own instances? > > > > ------------------------- > > > > Reducing flakiness of tests is key. By not having dependencies from > the > > > > core project on external services/instances of data stores we have > > > > guaranteed access to the services and the group can fix issues that > > > arise. > > > > > > > > An exception would be something that has an ops team supporting it > (eg, > > > > AWS, Google Cloud or other professionally managed service) - those we > > > trust > > > > will be stable. > > > > > > > > > > > > There may be a lot of different data stores needed - how will we > > maintain > > > > them? > > > > ------------------------- > > > > It will take work above and beyond that of a normal set of unit tests > > to > > > > build and maintain integration/performance tests & their data store > > > > instances. > > > > > > > > Setup & maintenance of the data store containers and data store > > instances > > > > on it must be automated. It also has to be as simple of a setup as > > > > possible, and we should avoid hand tweaking the containers - > expecting > > > > checked in scripts/dockerfiles is key. > > > > > > > > Aligned with the community ownership approach of Apache, as members > of > > > the > > > > community are excited to contribute & maintain those tests and the > > > > integration/performance tests, people will be able to step up and do > > > that. > > > > If there is no longer support for maintaining a particular set of > > > > integration & performance tests and their data store instances, then > we > > > can > > > > disable those tests. We may document on the website what IO > Transforms > > > have > > > > current integration/performance tests so users know what level of > > testing > > > > the various IO Transforms have. > > > > > > > > > > > > What about requirements for the container management software itself? > > > > ------------------------- > > > > * We should have the data store instances themselves in Docker. > Docker > > > > allows new instances to be spun up in a quick, reproducible way and > is > > > > fairly platform independent. It has wide support from a variety of > > > > different container management services. > > > > * As little admin work required as possible. Crashing instances > should > > be > > > > restarted, setup should be simple, everything possible should be > > > > scripted/scriptable. > > > > * Logs and test output should be on a publicly available website, > > without > > > > needing to log into test execution machine. Centralized capture of > > > > monitoring info/logs from instances running in the containers would > > > support > > > > this. Ideally, this would just be supported by the container software > > out > > > > of the box. > > > > * It'd be useful to have good persistent volume in the container > > > management > > > > software so that databases don't have to reload large data sets every > > > time. > > > > * The containers may be a place to execute runners themselves if we > > need > > > > larger runner instances, so it should play well with Spark, Flink, > etc… > > > > > > > > As I discussed earlier on the mailing list, it looks like hosting > > docker > > > > containers on kubernetes, docker swarm or mesos+marathon would be a > > good > > > > solution. > > > > > > > > Thanks, > > > > Stephen Sisk > > > > > > > > > > -- > > > Jean-Baptiste Onofré > > > jbono...@apache.org > > > http://blog.nanthrax.net > > > Talend - http://www.talend.com > > > > > >