Thanks for your analysis Stephen, good arguments / references. One quick question. Have you checked the APIs of both (Mesos/Kubernetes) to see if we can do programmatically do more complex tests (I suppose so, but you don't mention how easy or if those are possible), for example to simulate a slow networking slave (to test stragglers), or to arbitrarily kill one slave (e.g. if I want to test the correct behavior of a runner/IO that is reading from it) ?
Other missing point in the review is the availability of ready to play packages, I think in this area mesos/dcos seems more advanced no? I haven't looked recently but at least 6 months ago there were not many helm packages ready for example to test kafka or the hadoop echosystem stuff (hdfs, hbase, etc). Has this been improved ? because preparing this also is a considerable amount of work on the other hand this could be also a chance to contribute to kubernetes. Regards, Ismaël On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <[email protected]> wrote: > hi! > > I've been continuing this investigation, and have some more info to report, > and hopefully we can start making some decisions. > > To support performance testing, I've been investigating mesos+marathon and > kubernetes for running data stores in their high availability mode. I have > been examining features that kubernetes/mesos+marathon use to support this. > > Setting up a multi-node cluster in a high availability mode tends to be > more expensive time-wise than the single node instances I've played around > with in the past. Rather than do a full build out with both kubernetes and > mesos, I'd like to pick one of the two options to build the prototype > cluster with. If the prototype doesn't go well, we could still go back to > the other option, but I'd like to change us from a mode of "let's look at > all the options" to one of "here's the favorite, let's prove that works for > us". > > Below are the features that I've seen are important to multi-node instances > of data stores. I'm sure other folks on the list have done this before, so > feel free to pipe up if I'm missing a good solution to a problem. > > DNS/Discovery > > -------------------- > > Necessary for talking between nodes (eg, cassandra nodes all need to be > able to talk to a set of seed nodes.) > > * Kubernetes has built-in DNS/discovery between nodes. > > * Mesos has supports this via mesos-dns, which isn't a part of core mesos, > but is in dcos, which is the mesos distribution I've been using and that I > would expect us to use. > > Instances properly distributed across nodes > > ------------------------------------------------------------ > > If multiple instances of a data source end up on the same underlying VM, we > may not get good performance out of those instances since the underlying VM > may be more taxed than other VMs. > > * Kubernetes has a beta feature StatefulSets[1] which allow for containers > distributed so that there's one container per underlying machine (as well > as a lot of other useful features like easy stable dns names.) > > * Mesos can support this via the built in UNIQUE constraint [2] > > Load balancing > > -------------------- > > Incoming requests from users need to be distributed to the various machines > - this is important for many data stores' high availability modes. > > * Kubernetes supports easily hooking up to an external load balancer when > on a cloud (and can be configured to work with a built-in load balancer if > not) > > * Mesos supports this via marathon-lb [3], which is an install-able package > in DC/OS > > Persistent Volumes tied to specific instances > > ------------------------------------------------------------ > > Databases often need persistent state (for example to store the data :), so > it's an important part of running our service. > > * Kubernetes StatefulSets supports this > > * Mesos+marathon apps with persistent volumes supports this [4] [5] > > As I mentioned above, I'd like to focus on either kubernetes or mesos for > my investigation, and as I go further along, I'm seeing kubernetes as > better suited to our needs. > > (1) It supports more of the features we want out of the box and with > StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS > requires marathon-lb to be installed and mesos-dns to be configured. > > (2) I'm also finding that there seem to be more examples of using > kubernetes to solve the types of problems we're working on. This is > somewhat subjective, but in my experience as I've tried to learn both > kubernetes and mesos, I personally found it generally easier to get > kubernetes running than mesos due to the tutorials/examples available for > kubernetes. > > (3) Lower cost of initial setup - as I discussed in a previous mail[6], > kubernetes was far easier to get set up even when I knew the exact steps. > Mesos took me around 27 steps [7], which involved a lot of config that was > easy to get wrong (it took me about 5 tries to get all the steps correct in > one go.) Kubernetes took me around 8 steps and very little config. > > Given that, I'd like to focus my investigation/prototyping on Kubernetes. > To > be clear, it's fairly close and I think both Mesos and Kubernetes could > support what we need, so if we run into issues with kubernetes, Mesos still > seems like a viable option that we could fall back to. > > Thanks, > Stephen > > > [1] Kubernetes StatefulSets > https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/ > > [2] mesos unique constraint - > https://mesosphere.github.io/marathon/docs/constraints.html > > [3] > https://mesosphere.github.io/marathon/docs/service- > discovery-load-balancing.html > and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/ > > [4] https://mesosphere.github.io/marathon/docs/persistent-volumes.html > > [5] https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/ > > [6] Container Orchestration software for hosting data stores > https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0 > e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E > > [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md > > > On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <[email protected]> wrote: > > > Just a quick drive-by comment: how tests are laid out has non-trivial > > tradeoffs on how/where continuous integration runs, and how results are > > integrated into the tooling. The current state is certainly not ideal > > (e.g., due to multiple test executions some links in Jenkins point where > > they shouldn't), but most other alternatives had even bigger drawbacks at > > the time. If someone has great ideas that don't explode the number of > > modules, please share ;-) > > > > On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot <[email protected]> > > wrote: > > > > > Hi Stephen, > > > > > > Thanks for taking the time to comment. > > > > > > My comments are bellow in the email: > > > > > > > > > Le 24/12/2016 à 00:07, Stephen Sisk a écrit : > > > > > >> hey Etienne - > > >> > > >> thanks for your thoughts and thanks for sharing your experiences. I > > >> generally agree with what you're saying. Quick comments below: > > >> > > >> IT are stored alongside with UT in src/test directory of the IO but > they > > >>> > > >> might go to dedicated module, waiting for a consensus > > >> I don't have a strong opinion or feel that I've worked enough with > maven > > >> to > > >> understand all the consequences - I'd love for someone with more maven > > >> experience to weigh in. If this becomes blocking, I'd say check it in, > > and > > >> we can refactor later if it proves problematic. > > >> > > > Sure, not a blocking point, it could be refactored afterwards. Just as > a > > > reminder, JB mentioned that storing IT in separate module allows to > have > > > more coherence between all IT (same behavior) and to do cross IO > > > integration tests. JB, have you experienced some long term drawbacks of > > > storing IT in a separate module, like, for example, more difficult > > > maintenance due to "distance" with production code? > > > > > > > > >> Also IMHO, it is better that tests load/clean data than doing some > > >>> > > >> assumptions about the running order of the tests. > > >> I definitely agree that we don't want to make assumptions about the > > >> running > > >> order of the tests - that way lies pain. :) It will be interesting to > > see > > >> how the performance tests work out since they will need more data (and > > >> thus > > >> loading data can take much longer.) > > >> > > > Yes, performance testing might push in the direction of data loading > from > > > outside the tests due to loading time. > > > > > >> This should also be an easier problem > > >> for read tests than for write tests - if we have long running > instances, > > >> read tests don't really need cleanup. And if write tests only write a > > >> small > > >> amount of data, as long as we are sure we're writing to uniquely > > >> identifiable locations (ie, new table per test or something similar), > we > > >> can clean up the write test data on a slower schedule. > > >> > > > I agree > > > > > >> > > >> this will tend to go to the direction of long running data store > > >>> > > >> instances rather than data store instances started (and optionally > > loaded) > > >> before tests. > > >> It may be easiest to start with a "data stores stay running" > > >> implementation, and then if we see issues with that move towards tests > > >> that > > >> start/stop the data stores on each run. One thing I'd like to make > sure > > is > > >> that we're not manually tweaking the configurations for data stores. > One > > >> way we could do that is to destroy/recreate the data stores on a > slower > > >> schedule - maybe once per week. That way if the script is changed or > the > > >> data store instances are changed, we'd be able to detect it relatively > > >> soon > > >> while still removing the need for the tests to manage the data stores. > > >> > > > I agree. In addition to configuration manual tweaking, there might be > > > cases in which a data store re-partition data during a test or after > some > > > tests while the dataset changes. The IO must be tolerant to that but > the > > > asserts (number of bundles for example) in test must not fail in that > > case. > > > I would also prefer if possible that the tests do not manage data > stores > > > (not setup them, not start them, not stop them) > > > > > > > > >> as a general note, I suspect many of the folks in the states will be > on > > >> holiday until Jan 2nd/3rd. > > >> > > >> S > > >> > > >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot <[email protected] > > > > >> wrote: > > >> > > >> Hi, > > >>> > > >>> Recently we had a discussion about integration tests of IOs. I'm > > >>> preparing a PR for integration tests of the elasticSearch IO > > >>> ( > > >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E > > >>> LASTICSEARCH-IO > > >>> as a first shot) which are very important IMHO because they helped > > catch > > >>> some bugs that UT could not (volume, data store instance sharing, > real > > >>> data store instance ...) > > >>> > > >>> I would like to have your thoughts/remarks about points bellow. Some > of > > >>> these points are also discussed here > > >>> > > >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np > > >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a > > >>> : > > >>> > > >>> - UT and IT have a similar architecture, but while UT focus on > testing > > >>> the correct behavior of the code including corner cases and use > > embedded > > >>> in memory data store, IT assume that the behavior is correct (strong > > UT) > > >>> and focus on higher volume testing and testing against real data > store > > >>> instance(s) > > >>> > > >>> - For now, IT are stored alongside with UT in src/test directory of > the > > >>> IO but they might go to dedicated module, waiting for a consensus. > > Maven > > >>> is not configured to run them automatically because data store is not > > >>> available on jenkins server yet > > >>> > > >>> - For now, they only use DirectRunner, but they will be run against > > >>> each runner. > > >>> > > >>> - IT do not setup data store instance (like stated in the above > > >>> document) they assume that one is already running (hardcoded > > >>> configuration in test for now, waiting for a common solution to pass > > >>> configuration to IT). A docker container script is provided in the > > >>> contrib directory as a starting point to whatever orchestration > > software > > >>> will be chosen. > > >>> > > >>> - IT load and clean test data before and after each test if needed. > It > > >>> is simpler to do so because some tests need empty data store (write > > >>> test) and because, as discussed in the document, tests might not be > the > > >>> only users of the data store. Also IMHO, it is better that tests > > >>> load/clean data than doing some assumptions about the running order > of > > >>> the tests. > > >>> > > >>> If we generalize this pattern to all IT tests, this will tend to go > to > > >>> the direction of long running data store instances rather than data > > >>> store instances started (and optionally loaded) before tests. > > >>> > > >>> Besides if we where to change our minds and load data from outside > the > > >>> tests, a logstash script is provided. > > >>> > > >>> If you have any thoughts or remarks I'm all ears :) > > >>> > > >>> Regards, > > >>> > > >>> Etienne > > >>> > > >>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit : > > >>> > > >>>> Hi Stephen, > > >>>> > > >>>> the purpose of having in a specific module is to share resources and > > >>>> apply the same behavior from IT perspective and be able to have IT > > >>>> "cross" IO (for instance, reading from JMS and sending to Kafka, I > > >>>> think that's the key idea for integration tests). > > >>>> > > >>>> For instance, in Karaf, we have: > > >>>> - utest in each module > > >>>> - itest module containing itests for all modules all together > > >>>> > > >>>> Regards > > >>>> JB > > >>>> > > >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote: > > >>>> > > >>>>> Hi Etienne, > > >>>>> > > >>>>> thanks for following up and answering my questions. > > >>>>> > > >>>>> re: where to store integration tests - having them all in a > separate > > >>>>> module > > >>>>> is an interesting idea. I couldn't find JB's comments about moving > > them > > >>>>> into a separate module in the PR - can you share the reasons for > > >>>>> doing so? > > >>>>> The IO integration/perf tests so it does seem like they'll need to > be > > >>>>> treated in a special manner, but given that there is already an IO > > >>>>> specific > > >>>>> module, it may just be that we need to treat all the ITs in the IO > > >>>>> module > > >>>>> the same way. I don't have strong opinions either way right now. > > >>>>> > > >>>>> S > > >>>>> > > >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot < > > [email protected]> > > >>>>> wrote: > > >>>>> > > >>>>> Hi guys, > > >>>>> > > >>>>> @Stephen: I addressed all your comments directly in the PR, thanks! > > >>>>> I just wanted to comment here about the docker image I used: the > only > > >>>>> official Elastic image contains only ElasticSearch. But for > testing I > > >>>>> needed logstash (for ingestion) and kibana (not for integration > > tests, > > >>>>> but to easily test REST requests to ES using sense). This is why I > > use > > >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one isreleased > > >>>>> under > > >>>>> theapache 2 license. > > >>>>> > > >>>>> > > >>>>> Besides, there is also a point about where to store integration > > tests: > > >>>>> JB proposed in the PR to store integration tests to dedicated > module > > >>>>> rather than directly in the IO module (like I did). > > >>>>> > > >>>>> > > >>>>> > > >>>>> Etienne > > >>>>> > > >>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit : > > >>>>> > > >>>>>> hey! > > >>>>>> > > >>>>>> thanks for sending this. I'm very excited to see this change. I > > >>>>>> added some > > >>>>>> detail-oriented code review comments in addition to what I've > > >>>>>> discussed > > >>>>>> here. > > >>>>>> > > >>>>>> The general goal is to allow for re-usable instantiation of > > particular > > >>>>>> > > >>>>> data > > >>>>> > > >>>>>> store instances and this seems like a good start. Looks like you > > >>>>>> also have > > >>>>>> a script to generate test data for your tests - that's great. > > >>>>>> > > >>>>>> The next steps (definitely not blocking your work) will be to have > > >>>>>> ways to > > >>>>>> create instances from the docker images you have here, and use > them > > >>>>>> in the > > >>>>>> tests. We'll need support in the test framework for that since > it'll > > >>>>>> be > > >>>>>> different on developer machines and in the beam jenkins cluster, > but > > >>>>>> your > > >>>>>> scripts here allow someone running these tests locally to not have > > to > > >>>>>> > > >>>>> worry > > >>>>> > > >>>>>> about getting the instance set up and can manually adjust, so this > > is > > >>>>>> a > > >>>>>> good incremental step. > > >>>>>> > > >>>>>> I have some thoughts now that I'm reviewing your scripts (that I > > >>>>>> didn't > > >>>>>> have previously, so we are learning this together): > > >>>>>> * It may be useful to try and document why we chose a particular > > >>>>>> docker > > >>>>>> image as the base (ie, "this is the official supported elastic > > search > > >>>>>> docker image" or "this image has several data stores together that > > >>>>>> can be > > >>>>>> used for a couple different tests") - I'm curious as to whether > the > > >>>>>> community thinks that is important > > >>>>>> > > >>>>>> One thing that I called out in the comment that's worth mentioning > > >>>>>> on the > > >>>>>> larger list - if you want to specify which specific runners a test > > >>>>>> uses, > > >>>>>> that can be controlled in the pom for the module. I updated the > > >>>>>> testing > > >>>>>> > > >>>>> doc > > >>>>> > > >>>>>> mentioned previously in this thread with a TODO to talk about this > > >>>>>> more. I > > >>>>>> think we should also make it so that IO modules have that > > >>>>>> automatically, > > >>>>>> > > >>>>> so > > >>>>> > > >>>>>> developers don't have to worry about it. > > >>>>>> > > >>>>>> S > > >>>>>> > > >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot < > > [email protected]> > > >>>>>> > > >>>>> wrote: > > >>>>> > > >>>>>> Stephen, > > >>>>>> > > >>>>>> As discussed, I added injection script, docker containers scripts > > and > > >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib > > >>>>>> < > > >>>>>> > > >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7 > > >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7 > > >>> d824cefcb3ed0b9 > > >>> > > >>>> directory in that PR: > > >>>>>> https://github.com/apache/incubator-beam/pull/1439. > > >>>>>> > > >>>>>> These work well but they are first shot. Do you have any comments > > >>>>>> about > > >>>>>> those? > > >>>>>> > > >>>>>> Besides I am not very sure that these files should be in the IO > > itself > > >>>>>> (even in contrib directory, out of maven source directories). Any > > >>>>>> > > >>>>> thoughts? > > >>>>> > > >>>>>> Thanks, > > >>>>>> > > >>>>>> Etienne > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit : > > >>>>>> > > >>>>>>> It's great to hear more experiences. > > >>>>>>> > > >>>>>>> I'm also glad to hear that people see real value in the high > > >>>>>>> volume/performance benchmark tests. I tried to capture that in > the > > >>>>>>> > > >>>>>> Testing > > >>>>> > > >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1] > > >>>>>>> > > >>>>>>> It does generally sound like we're in agreement here. Areas of > > >>>>>>> discussion > > >>>>>>> > > >>>>>> I > > >>>>>> > > >>>>>>> see: > > >>>>>>> 1. People like the idea of bringing up fresh instances for each > > test > > >>>>>>> rather than keeping instances running all the time, since that > > >>>>>>> ensures no > > >>>>>>> contamination between tests. That seems reasonable to me. If we > see > > >>>>>>> flakiness in the tests or we note that setting up/tearing down > > >>>>>>> instances > > >>>>>>> > > >>>>>> is > > >>>>>> > > >>>>>>> taking a lot of time, > > >>>>>>> 2. Deciding on cluster management software/orchestration software > > - I > > >>>>>>> > > >>>>>> want > > >>>>> > > >>>>>> to make sure we land on the right tool here since choosing the > > >>>>>>> wrong tool > > >>>>>>> could result in administration of the instances taking more > work. I > > >>>>>>> > > >>>>>> suspect > > >>>>>> > > >>>>>>> that's a good place for a follow up discussion, so I'll start a > > >>>>>>> separate > > >>>>>>> thread on that. I'm happy with whatever tool we choose, but I > want > > to > > >>>>>>> > > >>>>>> make > > >>>>> > > >>>>>> sure we take a moment to consider different options and have a > > >>>>>>> reason for > > >>>>>>> choosing one. > > >>>>>>> > > >>>>>>> Etienne - thanks for being willing to port your creation/other > > >>>>>>> scripts > > >>>>>>> over. You might be a good early tester of whether this system > works > > >>>>>>> well > > >>>>>>> for everyone. > > >>>>>>> > > >>>>>>> Stephen > > >>>>>>> > > >>>>>>> [1] Reasons for Beam Test Strategy - > > >>>>>>> > > >>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np > > >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec# > > >>> > > >>>> > > >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré > > >>>>>>> <[email protected]> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>> I second Etienne there. > > >>>>>>>> > > >>>>>>>> We worked together on the ElasticsearchIO and definitely, the > high > > >>>>>>>> valuable test we did were integration tests with ES on docker > and > > >>>>>>>> high > > >>>>>>>> volume. > > >>>>>>>> > > >>>>>>>> I think we have to distinguish the two kinds of tests: > > >>>>>>>> 1. utests are located in the IO itself and basically they should > > >>>>>>>> cover > > >>>>>>>> the core behaviors of the IO > > >>>>>>>> 2. itests are located as contrib in the IO (they could be part > of > > >>>>>>>> the IO > > >>>>>>>> but executed by the integration-test plugin or a specific > profile) > > >>>>>>>> that > > >>>>>>>> deals with "real" backend and high volumes. The resources > required > > >>>>>>>> by > > >>>>>>>> the itest can be bootstrapped by Jenkins (for instance using > > >>>>>>>> Mesos/Marathon and docker images as already discussed, and it's > > >>>>>>>> what I'm > > >>>>>>>> doing on my own "server"). > > >>>>>>>> > > >>>>>>>> It's basically what Stephen described. > > >>>>>>>> > > >>>>>>>> We have to not relay only on itest: utests are very important > and > > >>>>>>>> they > > >>>>>>>> validate the core behavior. > > >>>>>>>> > > >>>>>>>> My $0.01 ;) > > >>>>>>>> > > >>>>>>>> Regards > > >>>>>>>> JB > > >>>>>>>> > > >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote: > > >>>>>>>> > > >>>>>>>>> Hi Stephen, > > >>>>>>>>> > > >>>>>>>>> I like your proposition very much and I also agree that docker > + > > >>>>>>>>> some > > >>>>>>>>> orchestration software would be great ! > > >>>>>>>>> > > >>>>>>>>> On the elasticsearchIO (PR to be created this week) there is > > docker > > >>>>>>>>> container creation scripts and logstash data ingestion script > for > > >>>>>>>>> IT > > >>>>>>>>> environment available in contrib directory alongside with > > >>>>>>>>> integration > > >>>>>>>>> tests themselves. I'll be happy to make them compliant to new > IT > > >>>>>>>>> environment. > > >>>>>>>>> > > >>>>>>>>> What you say bellow about the need for external IT environment > is > > >>>>>>>>> particularly true. As an example with ES what came out in first > > >>>>>>>>> implementation was that there were problems starting at some > high > > >>>>>>>>> > > >>>>>>>> volume > > >>>>> > > >>>>>> of data (timeouts, ES windowing overflow...) that could not have > be > > >>>>>>>>> > > >>>>>>>> seen > > >>>>> > > >>>>>> on embedded ES version. Also there where some particularities to > > >>>>>>>>> external instance like secondary (replica) shards that where > not > > >>>>>>>>> > > >>>>>>>> visible > > >>>>> > > >>>>>> on embedded instance. > > >>>>>>>>> > > >>>>>>>>> Besides, I also favor bringing up instances before test because > > it > > >>>>>>>>> allows (amongst other things) to be sure to start on a fresh > > >>>>>>>>> dataset > > >>>>>>>>> > > >>>>>>>> for > > >>>>> > > >>>>>> the test to be deterministic. > > >>>>>>>>> > > >>>>>>>>> Etienne > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit : > > >>>>>>>>> > > >>>>>>>>>> Hi, > > >>>>>>>>>> > > >>>>>>>>>> I'm excited we're getting lots of discussion going. There are > > many > > >>>>>>>>>> threads > > >>>>>>>>>> of conversation here, we may choose to split some of them off > > >>>>>>>>>> into a > > >>>>>>>>>> different email thread. I'm also betting I missed some of the > > >>>>>>>>>> questions in > > >>>>>>>>>> this thread, so apologies ahead of time for that. Also > apologies > > >>>>>>>>>> for > > >>>>>>>>>> > > >>>>>>>>> the > > >>>>>> > > >>>>>>> amount of text, I provided some quick summaries at the top of > each > > >>>>>>>>>> section. > > >>>>>>>>>> > > >>>>>>>>>> Amit - thanks for your thoughts. I've responded in detail > below. > > >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of work > > >>>>>>>>>> here to > > >>>>>>>>>> > > >>>>>>>>> go > > >>>>> > > >>>>>> around. I'll try and think about how we can divide up some next > > >>>>>>>>>> steps > > >>>>>>>>>> (probably in a separate thread.) The main next step I see is > > >>>>>>>>>> deciding > > >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm working > on > > >>>>>>>>>> that, > > >>>>>>>>>> > > >>>>>>>>> but > > >>>>>>>> > > >>>>>>>>> having lots of different thoughts on what the > > >>>>>>>>>> advantages/disadvantages > > >>>>>>>>>> > > >>>>>>>>> of > > >>>>>>>> > > >>>>>>>>> those are would be helpful (I'm not entirely sure of the > > >>>>>>>>>> protocol for > > >>>>>>>>>> collaborating on sub-projects like this.) > > >>>>>>>>>> > > >>>>>>>>>> These issues are all related to what kind of tests we want to > > >>>>>>>>>> write. I > > >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all the > use > > >>>>>>>>>> cases > > >>>>>>>>>> we've discussed here (and thus should not block moving forward > > >>>>>>>>>> with > > >>>>>>>>>> this), > > >>>>>>>>>> but understanding what we want to test will help us understand > > >>>>>>>>>> how the > > >>>>>>>>>> cluster will be used. I'm working on a proposed user guide for > > >>>>>>>>>> testing > > >>>>>>>>>> > > >>>>>>>>> IO > > >>>>>>>> > > >>>>>>>>> Transforms, and I'm going to send out a link to that + a short > > >>>>>>>>>> summary > > >>>>>>>>>> > > >>>>>>>>> to > > >>>>>>>> > > >>>>>>>>> the list shortly so folks can get a better sense of where I'm > > >>>>>>>>>> coming > > >>>>>>>>>> from. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Here's my thinking on the questions we've raised here - > > >>>>>>>>>> > > >>>>>>>>>> Embedded versions of data stores for testing > > >>>>>>>>>> -------------------- > > >>>>>>>>>> Summary: yes! But we still need real data stores to test > > against. > > >>>>>>>>>> > > >>>>>>>>>> I am a gigantic fan of using embedded versions of the various > > data > > >>>>>>>>>> stores. > > >>>>>>>>>> I think we should test everything we possibly can using them, > > >>>>>>>>>> and do > > >>>>>>>>>> > > >>>>>>>>> the > > >>>>>> > > >>>>>>> majority of our correctness testing using embedded versions + the > > >>>>>>>>>> > > >>>>>>>>> direct > > >>>>>> > > >>>>>>> runner. However, it's also important to have at least one test > that > > >>>>>>>>>> actually connects to an actual instance, so we can get > coverage > > >>>>>>>>>> for > > >>>>>>>>>> things > > >>>>>>>>>> like credentials, real connection strings, etc... > > >>>>>>>>>> > > >>>>>>>>>> The key point is that embedded versions definitely can't cover > > the > > >>>>>>>>>> performance tests, so we need to host instances if we want to > > test > > >>>>>>>>>> > > >>>>>>>>> that. > > >>>>>> > > >>>>>>> I consider the integration tests/performance benchmarks to be > > >>>>>>>>>> costly > > >>>>>>>>>> things > > >>>>>>>>>> that we do only for the IO transforms with large amounts of > > >>>>>>>>>> community > > >>>>>>>>>> support/usage. A random IO transform used by a few users > doesn't > > >>>>>>>>>> necessarily need integration & perf tests, but for heavily > used > > IO > > >>>>>>>>>> transforms, there's a lot of community value in these tests. > The > > >>>>>>>>>> maintenance proposal below scales with the amount of community > > >>>>>>>>>> support > > >>>>>>>>>> for > > >>>>>>>>>> a particular IO transform. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Reusing data stores ("use the data stores across executions.") > > >>>>>>>>>> ------------------ > > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently used, very > > >>>>>>>>>> small > > >>>>>>>>>> instances that we keep up all the time + larger > multi-container > > >>>>>>>>>> data > > >>>>>>>>>> store > > >>>>>>>>>> instances that we spin up for perf tests. > > >>>>>>>>>> > > >>>>>>>>>> I don't think we need to have a strong answer to this > question, > > >>>>>>>>>> but I > > >>>>>>>>>> think > > >>>>>>>>>> we do need to know what range of capabilities we need, and use > > >>>>>>>>>> that to > > >>>>>>>>>> inform our requirements on the hosting infrastructure. I think > > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios I > > discuss > > >>>>>>>>>> > > >>>>>>>>> below. > > >>>>>> > > >>>>>>> I had been thinking of a hybrid approach - reuse some instances > and > > >>>>>>>>>> > > >>>>>>>>> don't > > >>>>>>>> > > >>>>>>>>> reuse others. Some tests require isolation from other tests > (eg. > > >>>>>>>>>> performance benchmarking), while others can easily re-use the > > same > > >>>>>>>>>> database/data store instance over time, provided they are > > >>>>>>>>>> written in > > >>>>>>>>>> > > >>>>>>>>> the > > >>>>>> > > >>>>>>> correct manner (eg. a simple read or write correctness > integration > > >>>>>>>>>> > > >>>>>>>>> tests) > > >>>>>>>> > > >>>>>>>>> To me, the question of whether to use one instance over time > for > > a > > >>>>>>>>>> test vs > > >>>>>>>>>> spin up an instance for each test comes down to a trade off > > >>>>>>>>>> between > > >>>>>>>>>> > > >>>>>>>>> these > > >>>>>>>> > > >>>>>>>>> factors: > > >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super flaky, > > >>>>>>>>>> we'll > > >>>>>>>>>> want to > > >>>>>>>>>> keep more instances up and running rather than bring them > > up/down. > > >>>>>>>>>> > > >>>>>>>>> (this > > >>>>>> > > >>>>>>> may also vary by the data store in question) > > >>>>>>>>>> 2. Frequency of testing - if we are running tests every 5 > > >>>>>>>>>> minutes, it > > >>>>>>>>>> > > >>>>>>>>> may > > >>>>>>>> > > >>>>>>>>> be wasteful to bring machines up/down every time. If we run > > >>>>>>>>>> tests once > > >>>>>>>>>> > > >>>>>>>>> a > > >>>>>> > > >>>>>>> day or week, it seems wasteful to keep the machines up the whole > > >>>>>>>>>> time. > > >>>>>>>>>> 3. Isolation requirements - If tests must be isolated, it > means > > we > > >>>>>>>>>> > > >>>>>>>>> either > > >>>>>>>> > > >>>>>>>>> have to bring up the instances for each test, or we have to > have > > >>>>>>>>>> some > > >>>>>>>>>> sort > > >>>>>>>>>> of signaling mechanism to indicate that a given instance is in > > >>>>>>>>>> use. I > > >>>>>>>>>> strongly favor bringing up an instance per test. > > >>>>>>>>>> 4. Number/size of containers - if we need a large number of > > >>>>>>>>>> machines > > >>>>>>>>>> for a > > >>>>>>>>>> particular test, keeping them running all the time will use > more > > >>>>>>>>>> resources. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> The major unknown to me is how flaky it'll be to spin these > up. > > >>>>>>>>>> I'm > > >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up, but I > > >>>>>>>>>> think the > > >>>>>>>>>> best > > >>>>>>>>>> way to test that is to start doing it. > > >>>>>>>>>> > > >>>>>>>>>> I suspect the sweet spot is the following: have a set of very > > >>>>>>>>>> small > > >>>>>>>>>> > > >>>>>>>>> data > > >>>>>> > > >>>>>>> store instances that stay up to support small-data-size > post-commit > > >>>>>>>>>> end to > > >>>>>>>>>> end tests (post-commits run frequently and the data size means > > the > > >>>>>>>>>> instances would not use many resources), combined with the > > >>>>>>>>>> ability to > > >>>>>>>>>> spin > > >>>>>>>>>> up larger instances for once a day/week performance benchmarks > > >>>>>>>>>> (these > > >>>>>>>>>> > > >>>>>>>>> use > > >>>>>>>> > > >>>>>>>>> up more resources and are used less frequently.) That's the mix > > >>>>>>>>>> I'll > > >>>>>>>>>> propose in my docs on testing IO transforms. If spinning up > new > > >>>>>>>>>> instances > > >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of spinning up > > >>>>>>>>>> instances > > >>>>>>>>>> for > > >>>>>>>>>> each test. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Management ("what's the overhead of managing such a > deployment") > > >>>>>>>>>> -------------------- > > >>>>>>>>>> Summary: I propose that anyone can contribute scripts for > > >>>>>>>>>> setting up > > >>>>>>>>>> > > >>>>>>>>> data > > >>>>>>>> > > >>>>>>>>> store instances + integration/perf tests, but if the community > > >>>>>>>>>> doesn't > > >>>>>>>>>> maintain a particular data store's tests, we disable the tests > > and > > >>>>>>>>>> turn off > > >>>>>>>>>> the data store instances. > > >>>>>>>>>> > > >>>>>>>>>> Management of these instances is a crucial question. First, > > let's > > >>>>>>>>>> > > >>>>>>>>> break > > >>>>> > > >>>>>> down what tasks we'll need to do on a recurring basis: > > >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both instance > & > > >>>>>>>>>> dependencies) - we don't want to have a lot of old versions > > >>>>>>>>>> vulnerable > > >>>>>>>>>> > > >>>>>>>>> to > > >>>>>>>> > > >>>>>>>>> attacks/buggy > > >>>>>>>>>> 2. Investigate breakages/regressions > > >>>>>>>>>> (I'm betting there will be more things we'll discover - let me > > >>>>>>>>>> know if > > >>>>>>>>>> you > > >>>>>>>>>> have suggestions) > > >>>>>>>>>> > > >>>>>>>>>> There's a couple goals I see: > > >>>>>>>>>> 1. We should only do sys admin work for things that give us a > > >>>>>>>>>> lot of > > >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up scripts > for > > >>>>>>>>>> data > > >>>>>>>>>> stores > > >>>>>>>>>> without a large community) > > >>>>>>>>>> 2. We should do as much as possible of testing via > > >>>>>>>>>> in-memory/embedded > > >>>>>>>>>> testing (as you brought up). > > >>>>>>>>>> 3. Reduce the amount of manual administration overhead > > >>>>>>>>>> > > >>>>>>>>>> As I discussed above, I think that integration > tests/performance > > >>>>>>>>>> benchmarks > > >>>>>>>>>> are costly things that we should do only for the IO transforms > > >>>>>>>>>> with > > >>>>>>>>>> > > >>>>>>>>> large > > >>>>>>>> > > >>>>>>>>> amounts of community support/usage. Thus, I propose that we > > >>>>>>>>>> limit the > > >>>>>>>>>> > > >>>>>>>>> IO > > >>>>>> > > >>>>>>> transforms that get integration tests & performance benchmarks to > > >>>>>>>>>> > > >>>>>>>>> those > > >>>>> > > >>>>>> that have community support for maintaining the data store > > >>>>>>>>>> instances. > > >>>>>>>>>> > > >>>>>>>>>> We can enforce this organically using some simple rules: > > >>>>>>>>>> 1. Investigating breakages/regressions: if a given > > >>>>>>>>>> integration/perf > > >>>>>>>>>> > > >>>>>>>>> test > > >>>>>> > > >>>>>>> starts failing and no one investigates it within a set period of > > >>>>>>>>>> time > > >>>>>>>>>> > > >>>>>>>>> (a > > >>>>>> > > >>>>>>> week?), we disable the tests and shut off the data store > > >>>>>>>>>> instances if > > >>>>>>>>>> > > >>>>>>>>> we > > >>>>>> > > >>>>>>> have instances running. When someone wants to step up and > > >>>>>>>>>> support it > > >>>>>>>>>> again, > > >>>>>>>>>> they can fix the test, check it in, and re-enable the test. > > >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira issue that > > >>>>>>>>>> is just > > >>>>>>>>>> "is > > >>>>>>>>>> the IO Transform X data store up to date?" - if the jira is > not > > >>>>>>>>>> resolved in > > >>>>>>>>>> a set period of time (1 month?), the perf/integration tests > are > > >>>>>>>>>> > > >>>>>>>>> disabled, > > >>>>>>>> > > >>>>>>>>> and the data store instances shut off. > > >>>>>>>>>> > > >>>>>>>>>> This is pretty flexible - > > >>>>>>>>>> * If a particular person or organization wants to support an > IO > > >>>>>>>>>> transform, > > >>>>>>>>>> they can. If a group of people all organically organize to > keep > > >>>>>>>>>> the > > >>>>>>>>>> > > >>>>>>>>> tests > > >>>>>>>> > > >>>>>>>>> running, they can. > > >>>>>>>>>> * It can be mostly automated - there's not a lot of central > > >>>>>>>>>> organizing > > >>>>>>>>>> work > > >>>>>>>>>> that needs to be done. > > >>>>>>>>>> > > >>>>>>>>>> Exposing the information about what IO transforms currently > have > > >>>>>>>>>> > > >>>>>>>>> running > > >>>>>> > > >>>>>>> IT/perf benchmarks on the website will let users know what IO > > >>>>>>>>>> > > >>>>>>>>> transforms > > >>>>>> > > >>>>>>> are well supported. > > >>>>>>>>>> > > >>>>>>>>>> I like this solution, but I also recognize this is a tricky > > >>>>>>>>>> problem. > > >>>>>>>>>> > > >>>>>>>>> This > > >>>>>>>> > > >>>>>>>>> is something the community needs to be supportive of, so I'm > > >>>>>>>>>> open to > > >>>>>>>>>> other > > >>>>>>>>>> thoughts. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Simulating failures in real nodes ("programmatic tests to > > simulate > > >>>>>>>>>> failure") > > >>>>>>>>>> ----------------- > > >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We should > > >>>>>>>>>> encourage a > > >>>>>>>>>> design pattern separating out network/retry logic from the > main > > IO > > >>>>>>>>>> transform logic > > >>>>>>>>>> > > >>>>>>>>>> We *could* create instance failure in any container management > > >>>>>>>>>> > > >>>>>>>>> software > > >>>>> > > >>>>>> - > > >>>>>>>> > > >>>>>>>>> we can use their programmatic APIs to determine which > containers > > >>>>>>>>>> are > > >>>>>>>>>> running the instances, and ask them to kill the container in > > >>>>>>>>>> question. > > >>>>>>>>>> > > >>>>>>>>> A > > >>>>>> > > >>>>>>> slow node would be trickier, but I'm sure we could figure it out > > >>>>>>>>>> - for > > >>>>>>>>>> example, add a network proxy that would delay responses. > > >>>>>>>>>> > > >>>>>>>>>> However, I would argue that this type of testing doesn't gain > > us a > > >>>>>>>>>> lot, and > > >>>>>>>>>> is complicated to set up. I think it will be easier to test > > >>>>>>>>>> network > > >>>>>>>>>> errors > > >>>>>>>>>> and retry behavior in unit tests for the IO transforms. > > >>>>>>>>>> > > >>>>>>>>>> Part of the way to handle this is to separate out the read > code > > >>>>>>>>>> from > > >>>>>>>>>> > > >>>>>>>>> the > > >>>>>> > > >>>>>>> network code (eg. bigtable has BigtableService). If you put the > > >>>>>>>>>> > > >>>>>>>>> "handle > > >>>>> > > >>>>>> errors/retry logic" code in a separate MySourceService class, > > >>>>>>>>>> you can > > >>>>>>>>>> test > > >>>>>>>>>> MySourceService on the wide variety of networks errors/data > > store > > >>>>>>>>>> problems, > > >>>>>>>>>> and then your main IO transform tests focus on the read > behavior > > >>>>>>>>>> and > > >>>>>>>>>> handling the small set of errors the MySourceService class > will > > >>>>>>>>>> > > >>>>>>>>> return. > > >>>>> > > >>>>>> I also think we should focus on testing the IO Transform, not > > >>>>>>>>>> the data > > >>>>>>>>>> store - if we kill a node in a data store, it's that data > > store's > > >>>>>>>>>> problem, > > >>>>>>>>>> not beam's problem. As you were pointing out, there are a > > *large* > > >>>>>>>>>> number of > > >>>>>>>>>> possible ways that a particular data store can fail, and we > > >>>>>>>>>> would like > > >>>>>>>>>> > > >>>>>>>>> to > > >>>>>>>> > > >>>>>>>>> support many different data stores. Rather than try to test > that > > >>>>>>>>>> each > > >>>>>>>>>> data > > >>>>>>>>>> store behaves well, we should ensure that we handle > > >>>>>>>>>> generic/expected > > >>>>>>>>>> errors > > >>>>>>>>>> in a graceful manner. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Ismaeal had a couple other quick comments/questions, I'll > answer > > >>>>>>>>>> here > > >>>>>>>>>> > > >>>>>>>>> - > > >>>>> > > >>>>>> We can use this to test other runners running on multiple > > >>>>>>>>>> machines - I > > >>>>>>>>>> agree. This is also necessary for a good performance benchmark > > >>>>>>>>>> test. > > >>>>>>>>>> > > >>>>>>>>>> "providing the test machines to mount the cluster" - we can > > >>>>>>>>>> discuss > > >>>>>>>>>> > > >>>>>>>>> this > > >>>>>> > > >>>>>>> further, but one possible option is that google may be willing to > > >>>>>>>>>> > > >>>>>>>>> donate > > >>>>>> > > >>>>>>> something to support this. > > >>>>>>>>>> > > >>>>>>>>>> "IO Consistency" - let's follow up on those questions in > another > > >>>>>>>>>> > > >>>>>>>>> thread. > > >>>>>> > > >>>>>>> That's as much about the public interface we provide to users as > > >>>>>>>>>> > > >>>>>>>>> anything > > >>>>>>>> > > >>>>>>>>> else. I agree with your sentiment that a user should be able to > > >>>>>>>>>> expect > > >>>>>>>>>> predictable behavior from the different IO transforms. > > >>>>>>>>>> > > >>>>>>>>>> Thanks for everyone's questions/comments - I really am excited > > >>>>>>>>>> to see > > >>>>>>>>>> that > > >>>>>>>>>> people care about this :) > > >>>>>>>>>> > > >>>>>>>>>> Stephen > > >>>>>>>>>> > > >>>>>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía < > [email protected] > > > > > >>>>>>>>>> > > >>>>>>>>> wrote: > > >>>>> > > >>>>>> Hello, > > >>>>>>>>>>> > > >>>>>>>>>>> @Stephen Thanks for your proposal, it is really interesting, > I > > >>>>>>>>>>> would > > >>>>>>>>>>> really > > >>>>>>>>>>> like to help with this. I have never played with Kubernetes > but > > >>>>>>>>>>> this > > >>>>>>>>>>> seems > > >>>>>>>>>>> a really nice chance to do something useful with it. > > >>>>>>>>>>> > > >>>>>>>>>>> We (at Talend) are testing most of the IOs using simple > > container > > >>>>>>>>>>> > > >>>>>>>>>> images > > >>>>>>>> > > >>>>>>>>> and in some particular cases ‘clusters’ of containers using > > >>>>>>>>>>> docker-compose > > >>>>>>>>>>> (a little bit like Amit’s (2) proposal). It would be really > > >>>>>>>>>>> nice to > > >>>>>>>>>>> > > >>>>>>>>>> have > > >>>>>>>> > > >>>>>>>>> this at the Beam level, in particular to try to test more > complex > > >>>>>>>>>>> semantics, I don’t know how programmable kubernetes is to > > achieve > > >>>>>>>>>>> this for > > >>>>>>>>>>> example: > > >>>>>>>>>>> > > >>>>>>>>>>> Let’s think we have a cluster of Cassandra or Kafka nodes, I > > >>>>>>>>>>> would > > >>>>>>>>>>> like to > > >>>>>>>>>>> have programmatic tests to simulate failure (e.g. kill a > node), > > >>>>>>>>>>> or > > >>>>>>>>>>> simulate > > >>>>>>>>>>> a really slow node, to ensure that the IO behaves as expected > > >>>>>>>>>>> in the > > >>>>>>>>>>> Beam > > >>>>>>>>>>> pipeline for the given runner. > > >>>>>>>>>>> > > >>>>>>>>>>> Another related idea is to improve IO consistency: Today the > > >>>>>>>>>>> different IOs > > >>>>>>>>>>> have small differences in their failure behavior, I really > > >>>>>>>>>>> would like > > >>>>>>>>>>> to be > > >>>>>>>>>>> able to predict with more precision what will happen in case > of > > >>>>>>>>>>> > > >>>>>>>>>> errors, > > >>>>>> > > >>>>>>> e.g. what is the correct behavior if I am writing to a Kafka > > >>>>>>>>>>> node and > > >>>>>>>>>>> there > > >>>>>>>>>>> is a network partition, does the Kafka sink retries or no ? > and > > >>>>>>>>>>> what > > >>>>>>>>>>> if it > > >>>>>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming > > >>>>>>>>>>> checkpointing? > > >>>>>>>>>>> Or do > > >>>>>>>>>>> we guarantee exactly once writes somehow?, today I am not > sure > > >>>>>>>>>>> about > > >>>>>>>>>>> what > > >>>>>>>>>>> happens (or if the expected behavior depends on the runner), > > >>>>>>>>>>> but well > > >>>>>>>>>>> maybe > > >>>>>>>>>>> it is just that I don’t know and we have tests to ensure > this. > > >>>>>>>>>>> > > >>>>>>>>>>> Of course both are really hard problems, but I think with > your > > >>>>>>>>>>> proposal we > > >>>>>>>>>>> can try to tackle them, as well as the performance ones. And > > >>>>>>>>>>> apart of > > >>>>>>>>>>> the > > >>>>>>>>>>> data stores, I think it will be also really nice to be able > to > > >>>>>>>>>>> test > > >>>>>>>>>>> > > >>>>>>>>>> the > > >>>>>> > > >>>>>>> runners in a distributed manner. > > >>>>>>>>>>> > > >>>>>>>>>>> So what is the next step? How do you imagine such integration > > >>>>>>>>>>> tests? > > >>>>>>>>>>> ? Who > > >>>>>>>>>>> can provide the test machines so we can mount the cluster? > > >>>>>>>>>>> > > >>>>>>>>>>> Maybe my ideas are a bit too far away for an initial setup, > but > > >>>>>>>>>>> it > > >>>>>>>>>>> will be > > >>>>>>>>>>> really nice to start working on this. > > >>>>>>>>>>> > > >>>>>>>>>>> Ismael > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela < > > >>>>>>>>>>> [email protected] > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Hi Stephen, > > >>>>>>>>>>>> > > >>>>>>>>>>>> I was wondering about how we plan to use the data stores > > across > > >>>>>>>>>>>> > > >>>>>>>>>>> executions. > > >>>>>>>>>>> > > >>>>>>>>>>>> Clearly, it's best to setup a new instance (container) for > > every > > >>>>>>>>>>>> > > >>>>>>>>>>> test, > > >>>>>> > > >>>>>>> running a "standalone" store (say HBase/Cassandra for > > >>>>>>>>>>>> example), and > > >>>>>>>>>>>> once > > >>>>>>>>>>>> the test is done, teardown the instance. It should also be > > >>>>>>>>>>>> agnostic > > >>>>>>>>>>>> > > >>>>>>>>>>> to > > >>>>>> > > >>>>>>> the > > >>>>>>>>>>> > > >>>>>>>>>>>> runtime environment (e.g., Docker on Kubernetes). > > >>>>>>>>>>>> I'm wondering though what's the overhead of managing such a > > >>>>>>>>>>>> > > >>>>>>>>>>> deployment > > >>>>>> > > >>>>>>> which could become heavy and complicated as more IOs are > > >>>>>>>>>>>> supported > > >>>>>>>>>>>> > > >>>>>>>>>>> and > > >>>>>> > > >>>>>>> more > > >>>>>>>>>>> > > >>>>>>>>>>>> test cases introduced. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Another way to go would be to have small clusters of > different > > >>>>>>>>>>>> data > > >>>>>>>>>>>> > > >>>>>>>>>>> stores > > >>>>>>>>>>> > > >>>>>>>>>>>> and run against new "namespaces" (while lazily evicting old > > >>>>>>>>>>>> ones), > > >>>>>>>>>>>> but I > > >>>>>>>>>>>> think this is less likely as maintaining a distributed > > instance > > >>>>>>>>>>>> > > >>>>>>>>>>> (even > > >>>>> > > >>>>>> a > > >>>>>>>> > > >>>>>>>>> small one) for each data store sounds even more complex. > > >>>>>>>>>>>> > > >>>>>>>>>>>> A third approach would be to to simply have an "embedded" > > >>>>>>>>>>>> in-memory > > >>>>>>>>>>>> instance of a data store as part of a test that runs against > > it > > >>>>>>>>>>>> (such as > > >>>>>>>>>>>> > > >>>>>>>>>>> an > > >>>>>>>>>>> > > >>>>>>>>>>>> embedded Kafka, though not a data store). > > >>>>>>>>>>>> This is probably the simplest solution in terms of > > >>>>>>>>>>>> orchestration, > > >>>>>>>>>>>> but it > > >>>>>>>>>>>> depends on having a proper "embedded" implementation for an > > IO. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Does this make sense to you ? have you considered it ? > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks, > > >>>>>>>>>>>> Amit > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré < > > >>>>>>>>>>>> > > >>>>>>>>>>> [email protected] > > >>>>> > > >>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>> Hi Stephen, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> as already discussed a bit together, it sounds great ! > > >>>>>>>>>>>>> Especially I > > >>>>>>>>>>>>> > > >>>>>>>>>>>> like > > >>>>>>>>>>> > > >>>>>>>>>>>> it as a both integration test platform and good coverage for > > >>>>>>>>>>>>> IOs. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> I'm very late on this but, as said, I will share with you > my > > >>>>>>>>>>>>> > > >>>>>>>>>>>> Marathon > > >>>>>> > > >>>>>>> JSON and Mesos docker images. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> By the way, I started to experiment a bit kubernetes and > > >>>>>>>>>>>>> swamp but > > >>>>>>>>>>>>> it's > > >>>>>>>>>>>>> not yet complete. I will share what I have on the same > github > > >>>>>>>>>>>>> repo. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Thanks ! > > >>>>>>>>>>>>> Regards > > >>>>>>>>>>>>> JB > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Hi everyone! > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Currently we have a good set of unit tests for our IO > > >>>>>>>>>>>>>> Transforms - > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> those > > >>>>>>>>>>>> > > >>>>>>>>>>>>> tend to run against in-memory versions of the data stores. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> However, > > >>>>> > > >>>>>> we'd > > >>>>>>>>>>>> > > >>>>>>>>>>>>> like to further increase our test coverage to include > > >>>>>>>>>>>>>> running them > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> against > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> real instances of the data stores that the IO Transforms > > work > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> against > > >>>>>>>> > > >>>>>>>>> (e.g. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> cassandra, mongodb, kafka, etc…), which means we'll need > to > > >>>>>>>>>>>>>> have > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> real > > >>>>>>>> > > >>>>>>>>> instances of various data stores. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Additionally, if we want to do performance regression > > >>>>>>>>>>>>>> detection, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> it's > > >>>>>>>> > > >>>>>>>>> important to have instances of the services that behave > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> realistically, > > >>>>>>>>>>> > > >>>>>>>>>>>> which isn't true of in-memory or dev versions of the > services. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Proposed solution > > >>>>>>>>>>>>>> ------------------------- > > >>>>>>>>>>>>>> If we accept this proposal, we would create an > > >>>>>>>>>>>>>> infrastructure for > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> running > > >>>>>>>>>>>> > > >>>>>>>>>>>>> real instances of data stores inside of containers, using > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> container > > >>>>> > > >>>>>> management software like mesos/marathon, kubernetes, docker > > >>>>>>>>>>>>>> swarm, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> etc… > > >>>>>>>>>>> > > >>>>>>>>>>>> to > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> manage the instances. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> This would enable us to build integration tests that run > > >>>>>>>>>>>>>> against > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> those > > >>>>>>>>>>> > > >>>>>>>>>>>> real > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> instances and performance tests that run against those > real > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> instances > > >>>>>>>> > > >>>>>>>>> (like > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> those that Jason Kuster is proposing elsewhere.) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Why do we need one centralized set of instances vs just > > having > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> various > > >>>>>>>>>>> > > >>>>>>>>>>>> people host their own instances? > > >>>>>>>>>>>>>> ------------------------- > > >>>>>>>>>>>>>> Reducing flakiness of tests is key. By not having > > dependencies > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> from > > >>>>> > > >>>>>> the > > >>>>>>>>>>> > > >>>>>>>>>>>> core project on external services/instances of data stores > > >>>>>>>>>>>>>> we have > > >>>>>>>>>>>>>> guaranteed access to the services and the group can fix > > issues > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> that > > >>>>> > > >>>>>> arise. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> An exception would be something that has an ops team > > >>>>>>>>>>>>>> supporting it > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> (eg, > > >>>>>>>>>>> > > >>>>>>>>>>>> AWS, Google Cloud or other professionally managed service) - > > >>>>>>>>>>>>>> those > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> we > > >>>>>>>> > > >>>>>>>>> trust > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> will be stable. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> There may be a lot of different data stores needed - how > > >>>>>>>>>>>>>> will we > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> maintain > > >>>>>>>>>>>> > > >>>>>>>>>>>>> them? > > >>>>>>>>>>>>>> ------------------------- > > >>>>>>>>>>>>>> It will take work above and beyond that of a normal set of > > >>>>>>>>>>>>>> unit > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> tests > > >>>>>>>> > > >>>>>>>>> to > > >>>>>>>>>>>> > > >>>>>>>>>>>>> build and maintain integration/performance tests & their > data > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> store > > >>>>> > > >>>>>> instances. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Setup & maintenance of the data store containers and data > > >>>>>>>>>>>>>> store > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> instances > > >>>>>>>>>>>> > > >>>>>>>>>>>>> on it must be automated. It also has to be as simple of a > > >>>>>>>>>>>>>> setup as > > >>>>>>>>>>>>>> possible, and we should avoid hand tweaking the > containers - > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> expecting > > >>>>>>>>>>> > > >>>>>>>>>>>> checked in scripts/dockerfiles is key. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Aligned with the community ownership approach of Apache, > as > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> members > > >>>>> > > >>>>>> of > > >>>>>>>>>>> > > >>>>>>>>>>>> the > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> community are excited to contribute & maintain those tests > > >>>>>>>>>>>>>> and the > > >>>>>>>>>>>>>> integration/performance tests, people will be able to step > > >>>>>>>>>>>>>> up and > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> do > > >>>>>> > > >>>>>>> that. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> If there is no longer support for maintaining a particular > > >>>>>>>>>>>>>> set of > > >>>>>>>>>>>>>> integration & performance tests and their data store > > >>>>>>>>>>>>>> instances, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> then > > >>>>>> > > >>>>>>> we > > >>>>>>>>>>> > > >>>>>>>>>>>> can > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> disable those tests. We may document on the website what > IO > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> Transforms > > >>>>>>>>>>> > > >>>>>>>>>>>> have > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> current integration/performance tests so users know what > > >>>>>>>>>>>>>> level of > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> testing > > >>>>>>>>>>>> > > >>>>>>>>>>>>> the various IO Transforms have. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> What about requirements for the container management > > software > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> itself? > > >>>>>>>> > > >>>>>>>>> ------------------------- > > >>>>>>>>>>>>>> * We should have the data store instances themselves in > > >>>>>>>>>>>>>> Docker. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> Docker > > >>>>>>>>>>> > > >>>>>>>>>>>> allows new instances to be spun up in a quick, reproducible > > way > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> and > > >>>>> > > >>>>>> is > > >>>>>>>>>>> > > >>>>>>>>>>>> fairly platform independent. It has wide support from a > > >>>>>>>>>>>>>> variety of > > >>>>>>>>>>>>>> different container management services. > > >>>>>>>>>>>>>> * As little admin work required as possible. Crashing > > >>>>>>>>>>>>>> instances > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> should > > >>>>>>>>>>> > > >>>>>>>>>>>> be > > >>>>>>>>>>>> > > >>>>>>>>>>>>> restarted, setup should be simple, everything possible > > >>>>>>>>>>>>>> should be > > >>>>>>>>>>>>>> scripted/scriptable. > > >>>>>>>>>>>>>> * Logs and test output should be on a publicly available > > >>>>>>>>>>>>>> website, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> without > > >>>>>>>>>>>> > > >>>>>>>>>>>>> needing to log into test execution machine. Centralized > > >>>>>>>>>>>>>> capture of > > >>>>>>>>>>>>>> monitoring info/logs from instances running in the > > containers > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> would > > >>>>> > > >>>>>> support > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> this. Ideally, this would just be supported by the > container > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> software > > >>>>>>>> > > >>>>>>>>> out > > >>>>>>>>>>>> > > >>>>>>>>>>>>> of the box. > > >>>>>>>>>>>>>> * It'd be useful to have good persistent volume in the > > >>>>>>>>>>>>>> container > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> management > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> software so that databases don't have to reload large data > > >>>>>>>>>>>>>> sets > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> every > > >>>>>>>> > > >>>>>>>>> time. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> * The containers may be a place to execute runners > > >>>>>>>>>>>>>> themselves if > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> we > > >>>>> > > >>>>>> need > > >>>>>>>>>>>> > > >>>>>>>>>>>>> larger runner instances, so it should play well with Spark, > > >>>>>>>>>>>>>> Flink, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> etc… > > >>>>>>>>>>> > > >>>>>>>>>>>> As I discussed earlier on the mailing list, it looks like > > >>>>>>>>>>>>>> hosting > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> docker > > >>>>>>>>>>>> > > >>>>>>>>>>>>> containers on kubernetes, docker swarm or mesos+marathon > > >>>>>>>>>>>>>> would be > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> a > > >>>>> > > >>>>>> good > > >>>>>>>>>>>> > > >>>>>>>>>>>>> solution. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>> Stephen Sisk > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> -- > > >>>>>>>>>>>>> Jean-Baptiste Onofré > > >>>>>>>>>>>>> [email protected] > > >>>>>>>>>>>>> http://blog.nanthrax.net > > >>>>>>>>>>>>> Talend - http://www.talend.com > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> -- > > >>>>>>>> Jean-Baptiste Onofré > > >>>>>>>> [email protected] > > >>>>>>>> http://blog.nanthrax.net > > >>>>>>>> Talend - http://www.talend.com > > >>>>>>>> > > >>>>>>>> > > >>> > > > > > > > > >
