Makes sense, thanks for answering. On Tue, Nov 22, 2016 at 11:24 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Hi Sourabh, > > We raised the IO versioning point couple of months ago on the mailing list. > > Basically, we have two options: > > 1. Same modules (for example sdks/java/io/kafka) with one branch per > version (kafka-0.8 kafka-0.10) > 2. Several modules: sdks/java/io/kafka-0.8 sdks/java/io/kafka-0.10 > > My preferences is on 2: > Pros: > - the IO can still be part of the main Beam release > - it's more visible for contribution > Cons: > - we might have code duplication > > Regards > JB > > On 11/22/2016 08:12 PM, Sourabh Bajaj wrote: > > Hi, > > > > One tangential question I had around the proposal was how do we currently > > deal with versioning in IO sources/sinks. > > > > For example Cassandra 1.2 vs 2.1 have some differences between them, so > the > > checked in sources and sink probably supports a particular version right > > now. If yes, follow questions would be around how do we handle updating ? > > deprecating and documenting the supported versions. > > > > I can move this to a new thread if this seems like a different > discussion. > > Also if this has already been answered please feel free to direct me to a > > doc or past thread. > > > > Thanks > > Sourabh > > > > On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <ieme...@gmail.com> wrote: > > > >> Hello, > >> > >> @Stephen Thanks for your proposal, it is really interesting, I would > really > >> like to help with this. I have never played with Kubernetes but this > seems > >> a really nice chance to do something useful with it. > >> > >> We (at Talend) are testing most of the IOs using simple container images > >> and in some particular cases ‘clusters’ of containers using > docker-compose > >> (a little bit like Amit’s (2) proposal). It would be really nice to have > >> this at the Beam level, in particular to try to test more complex > >> semantics, I don’t know how programmable kubernetes is to achieve this > for > >> example: > >> > >> Let’s think we have a cluster of Cassandra or Kafka nodes, I would like > to > >> have programmatic tests to simulate failure (e.g. kill a node), or > simulate > >> a really slow node, to ensure that the IO behaves as expected in the > Beam > >> pipeline for the given runner. > >> > >> Another related idea is to improve IO consistency: Today the different > IOs > >> have small differences in their failure behavior, I really would like > to be > >> able to predict with more precision what will happen in case of errors, > >> e.g. what is the correct behavior if I am writing to a Kafka node and > there > >> is a network partition, does the Kafka sink retries or no ? and what if > it > >> is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or > do > >> we guarantee exactly once writes somehow?, today I am not sure about > what > >> happens (or if the expected behavior depends on the runner), but well > maybe > >> it is just that I don’t know and we have tests to ensure this. > >> > >> Of course both are really hard problems, but I think with your proposal > we > >> can try to tackle them, as well as the performance ones. And apart of > the > >> data stores, I think it will be also really nice to be able to test the > >> runners in a distributed manner. > >> > >> So what is the next step? How do you imagine such integration tests? ? > Who > >> can provide the test machines so we can mount the cluster? > >> > >> Maybe my ideas are a bit too far away for an initial setup, but it will > be > >> really nice to start working on this. > >> > >> Ismael > >> > >> > >> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <amitsel...@gmail.com> > wrote: > >> > >>> Hi Stephen, > >>> > >>> I was wondering about how we plan to use the data stores across > >> executions. > >>> > >>> Clearly, it's best to setup a new instance (container) for every test, > >>> running a "standalone" store (say HBase/Cassandra for example), and > once > >>> the test is done, teardown the instance. It should also be agnostic to > >> the > >>> runtime environment (e.g., Docker on Kubernetes). > >>> I'm wondering though what's the overhead of managing such a deployment > >>> which could become heavy and complicated as more IOs are supported and > >> more > >>> test cases introduced. > >>> > >>> Another way to go would be to have small clusters of different data > >> stores > >>> and run against new "namespaces" (while lazily evicting old ones), but > I > >>> think this is less likely as maintaining a distributed instance (even a > >>> small one) for each data store sounds even more complex. > >>> > >>> A third approach would be to to simply have an "embedded" in-memory > >>> instance of a data store as part of a test that runs against it (such > as > >> an > >>> embedded Kafka, though not a data store). > >>> This is probably the simplest solution in terms of orchestration, but > it > >>> depends on having a proper "embedded" implementation for an IO. > >>> > >>> Does this make sense to you ? have you considered it ? > >>> > >>> Thanks, > >>> Amit > >>> > >>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <j...@nanthrax.net> > >>> wrote: > >>> > >>>> Hi Stephen, > >>>> > >>>> as already discussed a bit together, it sounds great ! Especially I > >> like > >>>> it as a both integration test platform and good coverage for IOs. > >>>> > >>>> I'm very late on this but, as said, I will share with you my Marathon > >>>> JSON and Mesos docker images. > >>>> > >>>> By the way, I started to experiment a bit kubernetes and swamp but > it's > >>>> not yet complete. I will share what I have on the same github repo. > >>>> > >>>> Thanks ! > >>>> Regards > >>>> JB > >>>> > >>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote: > >>>>> Hi everyone! > >>>>> > >>>>> Currently we have a good set of unit tests for our IO Transforms - > >>> those > >>>>> tend to run against in-memory versions of the data stores. However, > >>> we'd > >>>>> like to further increase our test coverage to include running them > >>>> against > >>>>> real instances of the data stores that the IO Transforms work against > >>>> (e.g. > >>>>> cassandra, mongodb, kafka, etc…), which means we'll need to have real > >>>>> instances of various data stores. > >>>>> > >>>>> Additionally, if we want to do performance regression detection, it's > >>>>> important to have instances of the services that behave > >> realistically, > >>>>> which isn't true of in-memory or dev versions of the services. > >>>>> > >>>>> > >>>>> Proposed solution > >>>>> ------------------------- > >>>>> If we accept this proposal, we would create an infrastructure for > >>> running > >>>>> real instances of data stores inside of containers, using container > >>>>> management software like mesos/marathon, kubernetes, docker swarm, > >> etc… > >>>> to > >>>>> manage the instances. > >>>>> > >>>>> This would enable us to build integration tests that run against > >> those > >>>> real > >>>>> instances and performance tests that run against those real instances > >>>> (like > >>>>> those that Jason Kuster is proposing elsewhere.) > >>>>> > >>>>> > >>>>> Why do we need one centralized set of instances vs just having > >> various > >>>>> people host their own instances? > >>>>> ------------------------- > >>>>> Reducing flakiness of tests is key. By not having dependencies from > >> the > >>>>> core project on external services/instances of data stores we have > >>>>> guaranteed access to the services and the group can fix issues that > >>>> arise. > >>>>> > >>>>> An exception would be something that has an ops team supporting it > >> (eg, > >>>>> AWS, Google Cloud or other professionally managed service) - those we > >>>> trust > >>>>> will be stable. > >>>>> > >>>>> > >>>>> There may be a lot of different data stores needed - how will we > >>> maintain > >>>>> them? > >>>>> ------------------------- > >>>>> It will take work above and beyond that of a normal set of unit tests > >>> to > >>>>> build and maintain integration/performance tests & their data store > >>>>> instances. > >>>>> > >>>>> Setup & maintenance of the data store containers and data store > >>> instances > >>>>> on it must be automated. It also has to be as simple of a setup as > >>>>> possible, and we should avoid hand tweaking the containers - > >> expecting > >>>>> checked in scripts/dockerfiles is key. > >>>>> > >>>>> Aligned with the community ownership approach of Apache, as members > >> of > >>>> the > >>>>> community are excited to contribute & maintain those tests and the > >>>>> integration/performance tests, people will be able to step up and do > >>>> that. > >>>>> If there is no longer support for maintaining a particular set of > >>>>> integration & performance tests and their data store instances, then > >> we > >>>> can > >>>>> disable those tests. We may document on the website what IO > >> Transforms > >>>> have > >>>>> current integration/performance tests so users know what level of > >>> testing > >>>>> the various IO Transforms have. > >>>>> > >>>>> > >>>>> What about requirements for the container management software itself? > >>>>> ------------------------- > >>>>> * We should have the data store instances themselves in Docker. > >> Docker > >>>>> allows new instances to be spun up in a quick, reproducible way and > >> is > >>>>> fairly platform independent. It has wide support from a variety of > >>>>> different container management services. > >>>>> * As little admin work required as possible. Crashing instances > >> should > >>> be > >>>>> restarted, setup should be simple, everything possible should be > >>>>> scripted/scriptable. > >>>>> * Logs and test output should be on a publicly available website, > >>> without > >>>>> needing to log into test execution machine. Centralized capture of > >>>>> monitoring info/logs from instances running in the containers would > >>>> support > >>>>> this. Ideally, this would just be supported by the container software > >>> out > >>>>> of the box. > >>>>> * It'd be useful to have good persistent volume in the container > >>>> management > >>>>> software so that databases don't have to reload large data sets every > >>>> time. > >>>>> * The containers may be a place to execute runners themselves if we > >>> need > >>>>> larger runner instances, so it should play well with Spark, Flink, > >> etc… > >>>>> > >>>>> As I discussed earlier on the mailing list, it looks like hosting > >>> docker > >>>>> containers on kubernetes, docker swarm or mesos+marathon would be a > >>> good > >>>>> solution. > >>>>> > >>>>> Thanks, > >>>>> Stephen Sisk > >>>>> > >>>> > >>>> -- > >>>> Jean-Baptiste Onofré > >>>> jbono...@apache.org > >>>> http://blog.nanthrax.net > >>>> Talend - http://www.talend.com > >>>> > >>> > >> > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >