Hi Ismael Stephen will reply with details but I know he did a comparison and evaluate different options.
He tested with the jdbc Io itests. Regards JB On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <[email protected]> wrote: >Thanks for your analysis Stephen, good arguments / references. > >One quick question. Have you checked the APIs of both >(Mesos/Kubernetes) to >see >if we can do programmatically do more complex tests (I suppose so, but >you >don't mention how easy or if those are possible), for example to >simulate a >slow networking slave (to test stragglers), or to arbitrarily kill one >slave (e.g. if I want to test the correct behavior of a runner/IO that >is >reading from it) ? > >Other missing point in the review is the availability of ready to play >packages, >I think in this area mesos/dcos seems more advanced no? I haven't >looked >recently but at least 6 months ago there were not many helm packages >ready >for >example to test kafka or the hadoop echosystem stuff (hdfs, hbase, >etc). Has >this been improved ? because preparing this also is a considerable >amount of >work on the other hand this could be also a chance to contribute to >kubernetes. > >Regards, >Ismaël > > > >On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <[email protected]> >wrote: > >> hi! >> >> I've been continuing this investigation, and have some more info to >report, >> and hopefully we can start making some decisions. >> >> To support performance testing, I've been investigating >mesos+marathon and >> kubernetes for running data stores in their high availability mode. I >have >> been examining features that kubernetes/mesos+marathon use to support >this. >> >> Setting up a multi-node cluster in a high availability mode tends to >be >> more expensive time-wise than the single node instances I've played >around >> with in the past. Rather than do a full build out with both >kubernetes and >> mesos, I'd like to pick one of the two options to build the prototype >> cluster with. If the prototype doesn't go well, we could still go >back to >> the other option, but I'd like to change us from a mode of "let's >look at >> all the options" to one of "here's the favorite, let's prove that >works for >> us". >> >> Below are the features that I've seen are important to multi-node >instances >> of data stores. I'm sure other folks on the list have done this >before, so >> feel free to pipe up if I'm missing a good solution to a problem. >> >> DNS/Discovery >> >> -------------------- >> >> Necessary for talking between nodes (eg, cassandra nodes all need to >be >> able to talk to a set of seed nodes.) >> >> * Kubernetes has built-in DNS/discovery between nodes. >> >> * Mesos has supports this via mesos-dns, which isn't a part of core >mesos, >> but is in dcos, which is the mesos distribution I've been using and >that I >> would expect us to use. >> >> Instances properly distributed across nodes >> >> ------------------------------------------------------------ >> >> If multiple instances of a data source end up on the same underlying >VM, we >> may not get good performance out of those instances since the >underlying VM >> may be more taxed than other VMs. >> >> * Kubernetes has a beta feature StatefulSets[1] which allow for >containers >> distributed so that there's one container per underlying machine (as >well >> as a lot of other useful features like easy stable dns names.) >> >> * Mesos can support this via the built in UNIQUE constraint [2] >> >> Load balancing >> >> -------------------- >> >> Incoming requests from users need to be distributed to the various >machines >> - this is important for many data stores' high availability modes. >> >> * Kubernetes supports easily hooking up to an external load balancer >when >> on a cloud (and can be configured to work with a built-in load >balancer if >> not) >> >> * Mesos supports this via marathon-lb [3], which is an install-able >package >> in DC/OS >> >> Persistent Volumes tied to specific instances >> >> ------------------------------------------------------------ >> >> Databases often need persistent state (for example to store the data >:), so >> it's an important part of running our service. >> >> * Kubernetes StatefulSets supports this >> >> * Mesos+marathon apps with persistent volumes supports this [4] [5] >> >> As I mentioned above, I'd like to focus on either kubernetes or mesos >for >> my investigation, and as I go further along, I'm seeing kubernetes as >> better suited to our needs. >> >> (1) It supports more of the features we want out of the box and with >> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS >> requires marathon-lb to be installed and mesos-dns to be configured. >> >> (2) I'm also finding that there seem to be more examples of using >> kubernetes to solve the types of problems we're working on. This is >> somewhat subjective, but in my experience as I've tried to learn both >> kubernetes and mesos, I personally found it generally easier to get >> kubernetes running than mesos due to the tutorials/examples available >for >> kubernetes. >> >> (3) Lower cost of initial setup - as I discussed in a previous >mail[6], >> kubernetes was far easier to get set up even when I knew the exact >steps. >> Mesos took me around 27 steps [7], which involved a lot of config >that was >> easy to get wrong (it took me about 5 tries to get all the steps >correct in >> one go.) Kubernetes took me around 8 steps and very little config. >> >> Given that, I'd like to focus my investigation/prototyping on >Kubernetes. >> To >> be clear, it's fairly close and I think both Mesos and Kubernetes >could >> support what we need, so if we run into issues with kubernetes, Mesos >still >> seems like a viable option that we could fall back to. >> >> Thanks, >> Stephen >> >> >> [1] Kubernetes StatefulSets >> >https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/ >> >> [2] mesos unique constraint - >> https://mesosphere.github.io/marathon/docs/constraints.html >> >> [3] >> https://mesosphere.github.io/marathon/docs/service- >> discovery-load-balancing.html >> and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/ >> >> [4] >https://mesosphere.github.io/marathon/docs/persistent-volumes.html >> >> [5] >https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/ >> >> [6] Container Orchestration software for hosting data stores >> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0 >> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E >> >> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md >> >> >> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <[email protected]> >wrote: >> >> > Just a quick drive-by comment: how tests are laid out has >non-trivial >> > tradeoffs on how/where continuous integration runs, and how results >are >> > integrated into the tooling. The current state is certainly not >ideal >> > (e.g., due to multiple test executions some links in Jenkins point >where >> > they shouldn't), but most other alternatives had even bigger >drawbacks at >> > the time. If someone has great ideas that don't explode the number >of >> > modules, please share ;-) >> > >> > On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot ><[email protected]> >> > wrote: >> > >> > > Hi Stephen, >> > > >> > > Thanks for taking the time to comment. >> > > >> > > My comments are bellow in the email: >> > > >> > > >> > > Le 24/12/2016 à 00:07, Stephen Sisk a écrit : >> > > >> > >> hey Etienne - >> > >> >> > >> thanks for your thoughts and thanks for sharing your >experiences. I >> > >> generally agree with what you're saying. Quick comments below: >> > >> >> > >> IT are stored alongside with UT in src/test directory of the IO >but >> they >> > >>> >> > >> might go to dedicated module, waiting for a consensus >> > >> I don't have a strong opinion or feel that I've worked enough >with >> maven >> > >> to >> > >> understand all the consequences - I'd love for someone with more >maven >> > >> experience to weigh in. If this becomes blocking, I'd say check >it in, >> > and >> > >> we can refactor later if it proves problematic. >> > >> >> > > Sure, not a blocking point, it could be refactored afterwards. >Just as >> a >> > > reminder, JB mentioned that storing IT in separate module allows >to >> have >> > > more coherence between all IT (same behavior) and to do cross IO >> > > integration tests. JB, have you experienced some long term >drawbacks of >> > > storing IT in a separate module, like, for example, more >difficult >> > > maintenance due to "distance" with production code? >> > > >> > > >> > >> Also IMHO, it is better that tests load/clean data than doing >some >> > >>> >> > >> assumptions about the running order of the tests. >> > >> I definitely agree that we don't want to make assumptions about >the >> > >> running >> > >> order of the tests - that way lies pain. :) It will be >interesting to >> > see >> > >> how the performance tests work out since they will need more >data (and >> > >> thus >> > >> loading data can take much longer.) >> > >> >> > > Yes, performance testing might push in the direction of data >loading >> from >> > > outside the tests due to loading time. >> > > >> > >> This should also be an easier problem >> > >> for read tests than for write tests - if we have long running >> instances, >> > >> read tests don't really need cleanup. And if write tests only >write a >> > >> small >> > >> amount of data, as long as we are sure we're writing to uniquely >> > >> identifiable locations (ie, new table per test or something >similar), >> we >> > >> can clean up the write test data on a slower schedule. >> > >> >> > > I agree >> > > >> > >> >> > >> this will tend to go to the direction of long running data store >> > >>> >> > >> instances rather than data store instances started (and >optionally >> > loaded) >> > >> before tests. >> > >> It may be easiest to start with a "data stores stay running" >> > >> implementation, and then if we see issues with that move towards >tests >> > >> that >> > >> start/stop the data stores on each run. One thing I'd like to >make >> sure >> > is >> > >> that we're not manually tweaking the configurations for data >stores. >> One >> > >> way we could do that is to destroy/recreate the data stores on a >> slower >> > >> schedule - maybe once per week. That way if the script is >changed or >> the >> > >> data store instances are changed, we'd be able to detect it >relatively >> > >> soon >> > >> while still removing the need for the tests to manage the data >stores. >> > >> >> > > I agree. In addition to configuration manual tweaking, there >might be >> > > cases in which a data store re-partition data during a test or >after >> some >> > > tests while the dataset changes. The IO must be tolerant to that >but >> the >> > > asserts (number of bundles for example) in test must not fail in >that >> > case. >> > > I would also prefer if possible that the tests do not manage data >> stores >> > > (not setup them, not start them, not stop them) >> > > >> > > >> > >> as a general note, I suspect many of the folks in the states >will be >> on >> > >> holiday until Jan 2nd/3rd. >> > >> >> > >> S >> > >> >> > >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot ><[email protected] >> > >> > >> wrote: >> > >> >> > >> Hi, >> > >>> >> > >>> Recently we had a discussion about integration tests of IOs. >I'm >> > >>> preparing a PR for integration tests of the elasticSearch IO >> > >>> ( >> > >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E >> > >>> LASTICSEARCH-IO >> > >>> as a first shot) which are very important IMHO because they >helped >> > catch >> > >>> some bugs that UT could not (volume, data store instance >sharing, >> real >> > >>> data store instance ...) >> > >>> >> > >>> I would like to have your thoughts/remarks about points bellow. >Some >> of >> > >>> these points are also discussed here >> > >>> >> > >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np >> > >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a >> > >>> : >> > >>> >> > >>> - UT and IT have a similar architecture, but while UT focus on >> testing >> > >>> the correct behavior of the code including corner cases and use >> > embedded >> > >>> in memory data store, IT assume that the behavior is correct >(strong >> > UT) >> > >>> and focus on higher volume testing and testing against real >data >> store >> > >>> instance(s) >> > >>> >> > >>> - For now, IT are stored alongside with UT in src/test >directory of >> the >> > >>> IO but they might go to dedicated module, waiting for a >consensus. >> > Maven >> > >>> is not configured to run them automatically because data store >is not >> > >>> available on jenkins server yet >> > >>> >> > >>> - For now, they only use DirectRunner, but they will be run >against >> > >>> each runner. >> > >>> >> > >>> - IT do not setup data store instance (like stated in the above >> > >>> document) they assume that one is already running (hardcoded >> > >>> configuration in test for now, waiting for a common solution to >pass >> > >>> configuration to IT). A docker container script is provided in >the >> > >>> contrib directory as a starting point to whatever orchestration >> > software >> > >>> will be chosen. >> > >>> >> > >>> - IT load and clean test data before and after each test if >needed. >> It >> > >>> is simpler to do so because some tests need empty data store >(write >> > >>> test) and because, as discussed in the document, tests might >not be >> the >> > >>> only users of the data store. Also IMHO, it is better that >tests >> > >>> load/clean data than doing some assumptions about the running >order >> of >> > >>> the tests. >> > >>> >> > >>> If we generalize this pattern to all IT tests, this will tend >to go >> to >> > >>> the direction of long running data store instances rather than >data >> > >>> store instances started (and optionally loaded) before tests. >> > >>> >> > >>> Besides if we where to change our minds and load data from >outside >> the >> > >>> tests, a logstash script is provided. >> > >>> >> > >>> If you have any thoughts or remarks I'm all ears :) >> > >>> >> > >>> Regards, >> > >>> >> > >>> Etienne >> > >>> >> > >>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit : >> > >>> >> > >>>> Hi Stephen, >> > >>>> >> > >>>> the purpose of having in a specific module is to share >resources and >> > >>>> apply the same behavior from IT perspective and be able to >have IT >> > >>>> "cross" IO (for instance, reading from JMS and sending to >Kafka, I >> > >>>> think that's the key idea for integration tests). >> > >>>> >> > >>>> For instance, in Karaf, we have: >> > >>>> - utest in each module >> > >>>> - itest module containing itests for all modules all together >> > >>>> >> > >>>> Regards >> > >>>> JB >> > >>>> >> > >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote: >> > >>>> >> > >>>>> Hi Etienne, >> > >>>>> >> > >>>>> thanks for following up and answering my questions. >> > >>>>> >> > >>>>> re: where to store integration tests - having them all in a >> separate >> > >>>>> module >> > >>>>> is an interesting idea. I couldn't find JB's comments about >moving >> > them >> > >>>>> into a separate module in the PR - can you share the reasons >for >> > >>>>> doing so? >> > >>>>> The IO integration/perf tests so it does seem like they'll >need to >> be >> > >>>>> treated in a special manner, but given that there is already >an IO >> > >>>>> specific >> > >>>>> module, it may just be that we need to treat all the ITs in >the IO >> > >>>>> module >> > >>>>> the same way. I don't have strong opinions either way right >now. >> > >>>>> >> > >>>>> S >> > >>>>> >> > >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot < >> > [email protected]> >> > >>>>> wrote: >> > >>>>> >> > >>>>> Hi guys, >> > >>>>> >> > >>>>> @Stephen: I addressed all your comments directly in the PR, >thanks! >> > >>>>> I just wanted to comment here about the docker image I used: >the >> only >> > >>>>> official Elastic image contains only ElasticSearch. But for >> testing I >> > >>>>> needed logstash (for ingestion) and kibana (not for >integration >> > tests, >> > >>>>> but to easily test REST requests to ES using sense). This is >why I >> > use >> > >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one >isreleased >> > >>>>> under >> > >>>>> theapache 2 license. >> > >>>>> >> > >>>>> >> > >>>>> Besides, there is also a point about where to store >integration >> > tests: >> > >>>>> JB proposed in the PR to store integration tests to dedicated >> module >> > >>>>> rather than directly in the IO module (like I did). >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> Etienne >> > >>>>> >> > >>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit : >> > >>>>> >> > >>>>>> hey! >> > >>>>>> >> > >>>>>> thanks for sending this. I'm very excited to see this >change. I >> > >>>>>> added some >> > >>>>>> detail-oriented code review comments in addition to what >I've >> > >>>>>> discussed >> > >>>>>> here. >> > >>>>>> >> > >>>>>> The general goal is to allow for re-usable instantiation of >> > particular >> > >>>>>> >> > >>>>> data >> > >>>>> >> > >>>>>> store instances and this seems like a good start. Looks like >you >> > >>>>>> also have >> > >>>>>> a script to generate test data for your tests - that's >great. >> > >>>>>> >> > >>>>>> The next steps (definitely not blocking your work) will be >to have >> > >>>>>> ways to >> > >>>>>> create instances from the docker images you have here, and >use >> them >> > >>>>>> in the >> > >>>>>> tests. We'll need support in the test framework for that >since >> it'll >> > >>>>>> be >> > >>>>>> different on developer machines and in the beam jenkins >cluster, >> but >> > >>>>>> your >> > >>>>>> scripts here allow someone running these tests locally to >not have >> > to >> > >>>>>> >> > >>>>> worry >> > >>>>> >> > >>>>>> about getting the instance set up and can manually adjust, >so this >> > is >> > >>>>>> a >> > >>>>>> good incremental step. >> > >>>>>> >> > >>>>>> I have some thoughts now that I'm reviewing your scripts >(that I >> > >>>>>> didn't >> > >>>>>> have previously, so we are learning this together): >> > >>>>>> * It may be useful to try and document why we chose a >particular >> > >>>>>> docker >> > >>>>>> image as the base (ie, "this is the official supported >elastic >> > search >> > >>>>>> docker image" or "this image has several data stores >together that >> > >>>>>> can be >> > >>>>>> used for a couple different tests") - I'm curious as to >whether >> the >> > >>>>>> community thinks that is important >> > >>>>>> >> > >>>>>> One thing that I called out in the comment that's worth >mentioning >> > >>>>>> on the >> > >>>>>> larger list - if you want to specify which specific runners >a test >> > >>>>>> uses, >> > >>>>>> that can be controlled in the pom for the module. I updated >the >> > >>>>>> testing >> > >>>>>> >> > >>>>> doc >> > >>>>> >> > >>>>>> mentioned previously in this thread with a TODO to talk >about this >> > >>>>>> more. I >> > >>>>>> think we should also make it so that IO modules have that >> > >>>>>> automatically, >> > >>>>>> >> > >>>>> so >> > >>>>> >> > >>>>>> developers don't have to worry about it. >> > >>>>>> >> > >>>>>> S >> > >>>>>> >> > >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot < >> > [email protected]> >> > >>>>>> >> > >>>>> wrote: >> > >>>>> >> > >>>>>> Stephen, >> > >>>>>> >> > >>>>>> As discussed, I added injection script, docker containers >scripts >> > and >> > >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib >> > >>>>>> < >> > >>>>>> >> > >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7 >> > >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7 >> > >>> d824cefcb3ed0b9 >> > >>> >> > >>>> directory in that PR: >> > >>>>>> https://github.com/apache/incubator-beam/pull/1439. >> > >>>>>> >> > >>>>>> These work well but they are first shot. Do you have any >comments >> > >>>>>> about >> > >>>>>> those? >> > >>>>>> >> > >>>>>> Besides I am not very sure that these files should be in the >IO >> > itself >> > >>>>>> (even in contrib directory, out of maven source >directories). Any >> > >>>>>> >> > >>>>> thoughts? >> > >>>>> >> > >>>>>> Thanks, >> > >>>>>> >> > >>>>>> Etienne >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit : >> > >>>>>> >> > >>>>>>> It's great to hear more experiences. >> > >>>>>>> >> > >>>>>>> I'm also glad to hear that people see real value in the >high >> > >>>>>>> volume/performance benchmark tests. I tried to capture that >in >> the >> > >>>>>>> >> > >>>>>> Testing >> > >>>>> >> > >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1] >> > >>>>>>> >> > >>>>>>> It does generally sound like we're in agreement here. Areas >of >> > >>>>>>> discussion >> > >>>>>>> >> > >>>>>> I >> > >>>>>> >> > >>>>>>> see: >> > >>>>>>> 1. People like the idea of bringing up fresh instances for >each >> > test >> > >>>>>>> rather than keeping instances running all the time, since >that >> > >>>>>>> ensures no >> > >>>>>>> contamination between tests. That seems reasonable to me. >If we >> see >> > >>>>>>> flakiness in the tests or we note that setting up/tearing >down >> > >>>>>>> instances >> > >>>>>>> >> > >>>>>> is >> > >>>>>> >> > >>>>>>> taking a lot of time, >> > >>>>>>> 2. Deciding on cluster management software/orchestration >software >> > - I >> > >>>>>>> >> > >>>>>> want >> > >>>>> >> > >>>>>> to make sure we land on the right tool here since choosing >the >> > >>>>>>> wrong tool >> > >>>>>>> could result in administration of the instances taking more >> work. I >> > >>>>>>> >> > >>>>>> suspect >> > >>>>>> >> > >>>>>>> that's a good place for a follow up discussion, so I'll >start a >> > >>>>>>> separate >> > >>>>>>> thread on that. I'm happy with whatever tool we choose, but >I >> want >> > to >> > >>>>>>> >> > >>>>>> make >> > >>>>> >> > >>>>>> sure we take a moment to consider different options and have >a >> > >>>>>>> reason for >> > >>>>>>> choosing one. >> > >>>>>>> >> > >>>>>>> Etienne - thanks for being willing to port your >creation/other >> > >>>>>>> scripts >> > >>>>>>> over. You might be a good early tester of whether this >system >> works >> > >>>>>>> well >> > >>>>>>> for everyone. >> > >>>>>>> >> > >>>>>>> Stephen >> > >>>>>>> >> > >>>>>>> [1] Reasons for Beam Test Strategy - >> > >>>>>>> >> > >>>>>>> >https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np >> > >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec# >> > >>> >> > >>>> >> > >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré >> > >>>>>>> <[email protected]> >> > >>>>>>> wrote: >> > >>>>>>> >> > >>>>>>> I second Etienne there. >> > >>>>>>>> >> > >>>>>>>> We worked together on the ElasticsearchIO and definitely, >the >> high >> > >>>>>>>> valuable test we did were integration tests with ES on >docker >> and >> > >>>>>>>> high >> > >>>>>>>> volume. >> > >>>>>>>> >> > >>>>>>>> I think we have to distinguish the two kinds of tests: >> > >>>>>>>> 1. utests are located in the IO itself and basically they >should >> > >>>>>>>> cover >> > >>>>>>>> the core behaviors of the IO >> > >>>>>>>> 2. itests are located as contrib in the IO (they could be >part >> of >> > >>>>>>>> the IO >> > >>>>>>>> but executed by the integration-test plugin or a specific >> profile) >> > >>>>>>>> that >> > >>>>>>>> deals with "real" backend and high volumes. The resources >> required >> > >>>>>>>> by >> > >>>>>>>> the itest can be bootstrapped by Jenkins (for instance >using >> > >>>>>>>> Mesos/Marathon and docker images as already discussed, and >it's >> > >>>>>>>> what I'm >> > >>>>>>>> doing on my own "server"). >> > >>>>>>>> >> > >>>>>>>> It's basically what Stephen described. >> > >>>>>>>> >> > >>>>>>>> We have to not relay only on itest: utests are very >important >> and >> > >>>>>>>> they >> > >>>>>>>> validate the core behavior. >> > >>>>>>>> >> > >>>>>>>> My $0.01 ;) >> > >>>>>>>> >> > >>>>>>>> Regards >> > >>>>>>>> JB >> > >>>>>>>> >> > >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote: >> > >>>>>>>> >> > >>>>>>>>> Hi Stephen, >> > >>>>>>>>> >> > >>>>>>>>> I like your proposition very much and I also agree that >docker >> + >> > >>>>>>>>> some >> > >>>>>>>>> orchestration software would be great ! >> > >>>>>>>>> >> > >>>>>>>>> On the elasticsearchIO (PR to be created this week) there >is >> > docker >> > >>>>>>>>> container creation scripts and logstash data ingestion >script >> for >> > >>>>>>>>> IT >> > >>>>>>>>> environment available in contrib directory alongside with >> > >>>>>>>>> integration >> > >>>>>>>>> tests themselves. I'll be happy to make them compliant to >new >> IT >> > >>>>>>>>> environment. >> > >>>>>>>>> >> > >>>>>>>>> What you say bellow about the need for external IT >environment >> is >> > >>>>>>>>> particularly true. As an example with ES what came out in >first >> > >>>>>>>>> implementation was that there were problems starting at >some >> high >> > >>>>>>>>> >> > >>>>>>>> volume >> > >>>>> >> > >>>>>> of data (timeouts, ES windowing overflow...) that could not >have >> be >> > >>>>>>>>> >> > >>>>>>>> seen >> > >>>>> >> > >>>>>> on embedded ES version. Also there where some >particularities to >> > >>>>>>>>> external instance like secondary (replica) shards that >where >> not >> > >>>>>>>>> >> > >>>>>>>> visible >> > >>>>> >> > >>>>>> on embedded instance. >> > >>>>>>>>> >> > >>>>>>>>> Besides, I also favor bringing up instances before test >because >> > it >> > >>>>>>>>> allows (amongst other things) to be sure to start on a >fresh >> > >>>>>>>>> dataset >> > >>>>>>>>> >> > >>>>>>>> for >> > >>>>> >> > >>>>>> the test to be deterministic. >> > >>>>>>>>> >> > >>>>>>>>> Etienne >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit : >> > >>>>>>>>> >> > >>>>>>>>>> Hi, >> > >>>>>>>>>> >> > >>>>>>>>>> I'm excited we're getting lots of discussion going. >There are >> > many >> > >>>>>>>>>> threads >> > >>>>>>>>>> of conversation here, we may choose to split some of >them off >> > >>>>>>>>>> into a >> > >>>>>>>>>> different email thread. I'm also betting I missed some >of the >> > >>>>>>>>>> questions in >> > >>>>>>>>>> this thread, so apologies ahead of time for that. Also >> apologies >> > >>>>>>>>>> for >> > >>>>>>>>>> >> > >>>>>>>>> the >> > >>>>>> >> > >>>>>>> amount of text, I provided some quick summaries at the top >of >> each >> > >>>>>>>>>> section. >> > >>>>>>>>>> >> > >>>>>>>>>> Amit - thanks for your thoughts. I've responded in >detail >> below. >> > >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of >work >> > >>>>>>>>>> here to >> > >>>>>>>>>> >> > >>>>>>>>> go >> > >>>>> >> > >>>>>> around. I'll try and think about how we can divide up some >next >> > >>>>>>>>>> steps >> > >>>>>>>>>> (probably in a separate thread.) The main next step I >see is >> > >>>>>>>>>> deciding >> > >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm >working >> on >> > >>>>>>>>>> that, >> > >>>>>>>>>> >> > >>>>>>>>> but >> > >>>>>>>> >> > >>>>>>>>> having lots of different thoughts on what the >> > >>>>>>>>>> advantages/disadvantages >> > >>>>>>>>>> >> > >>>>>>>>> of >> > >>>>>>>> >> > >>>>>>>>> those are would be helpful (I'm not entirely sure of the >> > >>>>>>>>>> protocol for >> > >>>>>>>>>> collaborating on sub-projects like this.) >> > >>>>>>>>>> >> > >>>>>>>>>> These issues are all related to what kind of tests we >want to >> > >>>>>>>>>> write. I >> > >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all >the >> use >> > >>>>>>>>>> cases >> > >>>>>>>>>> we've discussed here (and thus should not block moving >forward >> > >>>>>>>>>> with >> > >>>>>>>>>> this), >> > >>>>>>>>>> but understanding what we want to test will help us >understand >> > >>>>>>>>>> how the >> > >>>>>>>>>> cluster will be used. I'm working on a proposed user >guide for >> > >>>>>>>>>> testing >> > >>>>>>>>>> >> > >>>>>>>>> IO >> > >>>>>>>> >> > >>>>>>>>> Transforms, and I'm going to send out a link to that + a >short >> > >>>>>>>>>> summary >> > >>>>>>>>>> >> > >>>>>>>>> to >> > >>>>>>>> >> > >>>>>>>>> the list shortly so folks can get a better sense of where >I'm >> > >>>>>>>>>> coming >> > >>>>>>>>>> from. >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> Here's my thinking on the questions we've raised here - >> > >>>>>>>>>> >> > >>>>>>>>>> Embedded versions of data stores for testing >> > >>>>>>>>>> -------------------- >> > >>>>>>>>>> Summary: yes! But we still need real data stores to test >> > against. >> > >>>>>>>>>> >> > >>>>>>>>>> I am a gigantic fan of using embedded versions of the >various >> > data >> > >>>>>>>>>> stores. >> > >>>>>>>>>> I think we should test everything we possibly can using >them, >> > >>>>>>>>>> and do >> > >>>>>>>>>> >> > >>>>>>>>> the >> > >>>>>> >> > >>>>>>> majority of our correctness testing using embedded versions >+ the >> > >>>>>>>>>> >> > >>>>>>>>> direct >> > >>>>>> >> > >>>>>>> runner. However, it's also important to have at least one >test >> that >> > >>>>>>>>>> actually connects to an actual instance, so we can get >> coverage >> > >>>>>>>>>> for >> > >>>>>>>>>> things >> > >>>>>>>>>> like credentials, real connection strings, etc... >> > >>>>>>>>>> >> > >>>>>>>>>> The key point is that embedded versions definitely can't >cover >> > the >> > >>>>>>>>>> performance tests, so we need to host instances if we >want to >> > test >> > >>>>>>>>>> >> > >>>>>>>>> that. >> > >>>>>> >> > >>>>>>> I consider the integration tests/performance benchmarks to >be >> > >>>>>>>>>> costly >> > >>>>>>>>>> things >> > >>>>>>>>>> that we do only for the IO transforms with large amounts >of >> > >>>>>>>>>> community >> > >>>>>>>>>> support/usage. A random IO transform used by a few users >> doesn't >> > >>>>>>>>>> necessarily need integration & perf tests, but for >heavily >> used >> > IO >> > >>>>>>>>>> transforms, there's a lot of community value in these >tests. >> The >> > >>>>>>>>>> maintenance proposal below scales with the amount of >community >> > >>>>>>>>>> support >> > >>>>>>>>>> for >> > >>>>>>>>>> a particular IO transform. >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> Reusing data stores ("use the data stores across >executions.") >> > >>>>>>>>>> ------------------ >> > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently >used, very >> > >>>>>>>>>> small >> > >>>>>>>>>> instances that we keep up all the time + larger >> multi-container >> > >>>>>>>>>> data >> > >>>>>>>>>> store >> > >>>>>>>>>> instances that we spin up for perf tests. >> > >>>>>>>>>> >> > >>>>>>>>>> I don't think we need to have a strong answer to this >> question, >> > >>>>>>>>>> but I >> > >>>>>>>>>> think >> > >>>>>>>>>> we do need to know what range of capabilities we need, >and use >> > >>>>>>>>>> that to >> > >>>>>>>>>> inform our requirements on the hosting infrastructure. I >think >> > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios >I >> > discuss >> > >>>>>>>>>> >> > >>>>>>>>> below. >> > >>>>>> >> > >>>>>>> I had been thinking of a hybrid approach - reuse some >instances >> and >> > >>>>>>>>>> >> > >>>>>>>>> don't >> > >>>>>>>> >> > >>>>>>>>> reuse others. Some tests require isolation from other >tests >> (eg. >> > >>>>>>>>>> performance benchmarking), while others can easily >re-use the >> > same >> > >>>>>>>>>> database/data store instance over time, provided they >are >> > >>>>>>>>>> written in >> > >>>>>>>>>> >> > >>>>>>>>> the >> > >>>>>> >> > >>>>>>> correct manner (eg. a simple read or write correctness >> integration >> > >>>>>>>>>> >> > >>>>>>>>> tests) >> > >>>>>>>> >> > >>>>>>>>> To me, the question of whether to use one instance over >time >> for >> > a >> > >>>>>>>>>> test vs >> > >>>>>>>>>> spin up an instance for each test comes down to a trade >off >> > >>>>>>>>>> between >> > >>>>>>>>>> >> > >>>>>>>>> these >> > >>>>>>>> >> > >>>>>>>>> factors: >> > >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super >flaky, >> > >>>>>>>>>> we'll >> > >>>>>>>>>> want to >> > >>>>>>>>>> keep more instances up and running rather than bring >them >> > up/down. >> > >>>>>>>>>> >> > >>>>>>>>> (this >> > >>>>>> >> > >>>>>>> may also vary by the data store in question) >> > >>>>>>>>>> 2. Frequency of testing - if we are running tests every >5 >> > >>>>>>>>>> minutes, it >> > >>>>>>>>>> >> > >>>>>>>>> may >> > >>>>>>>> >> > >>>>>>>>> be wasteful to bring machines up/down every time. If we >run >> > >>>>>>>>>> tests once >> > >>>>>>>>>> >> > >>>>>>>>> a >> > >>>>>> >> > >>>>>>> day or week, it seems wasteful to keep the machines up the >whole >> > >>>>>>>>>> time. >> > >>>>>>>>>> 3. Isolation requirements - If tests must be isolated, >it >> means >> > we >> > >>>>>>>>>> >> > >>>>>>>>> either >> > >>>>>>>> >> > >>>>>>>>> have to bring up the instances for each test, or we have >to >> have >> > >>>>>>>>>> some >> > >>>>>>>>>> sort >> > >>>>>>>>>> of signaling mechanism to indicate that a given instance >is in >> > >>>>>>>>>> use. I >> > >>>>>>>>>> strongly favor bringing up an instance per test. >> > >>>>>>>>>> 4. Number/size of containers - if we need a large number >of >> > >>>>>>>>>> machines >> > >>>>>>>>>> for a >> > >>>>>>>>>> particular test, keeping them running all the time will >use >> more >> > >>>>>>>>>> resources. >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> The major unknown to me is how flaky it'll be to spin >these >> up. >> > >>>>>>>>>> I'm >> > >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up, >but I >> > >>>>>>>>>> think the >> > >>>>>>>>>> best >> > >>>>>>>>>> way to test that is to start doing it. >> > >>>>>>>>>> >> > >>>>>>>>>> I suspect the sweet spot is the following: have a set of >very >> > >>>>>>>>>> small >> > >>>>>>>>>> >> > >>>>>>>>> data >> > >>>>>> >> > >>>>>>> store instances that stay up to support small-data-size >> post-commit >> > >>>>>>>>>> end to >> > >>>>>>>>>> end tests (post-commits run frequently and the data size >means >> > the >> > >>>>>>>>>> instances would not use many resources), combined with >the >> > >>>>>>>>>> ability to >> > >>>>>>>>>> spin >> > >>>>>>>>>> up larger instances for once a day/week performance >benchmarks >> > >>>>>>>>>> (these >> > >>>>>>>>>> >> > >>>>>>>>> use >> > >>>>>>>> >> > >>>>>>>>> up more resources and are used less frequently.) That's >the mix >> > >>>>>>>>>> I'll >> > >>>>>>>>>> propose in my docs on testing IO transforms. If >spinning up >> new >> > >>>>>>>>>> instances >> > >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of >spinning up >> > >>>>>>>>>> instances >> > >>>>>>>>>> for >> > >>>>>>>>>> each test. >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> Management ("what's the overhead of managing such a >> deployment") >> > >>>>>>>>>> -------------------- >> > >>>>>>>>>> Summary: I propose that anyone can contribute scripts >for >> > >>>>>>>>>> setting up >> > >>>>>>>>>> >> > >>>>>>>>> data >> > >>>>>>>> >> > >>>>>>>>> store instances + integration/perf tests, but if the >community >> > >>>>>>>>>> doesn't >> > >>>>>>>>>> maintain a particular data store's tests, we disable the >tests >> > and >> > >>>>>>>>>> turn off >> > >>>>>>>>>> the data store instances. >> > >>>>>>>>>> >> > >>>>>>>>>> Management of these instances is a crucial question. >First, >> > let's >> > >>>>>>>>>> >> > >>>>>>>>> break >> > >>>>> >> > >>>>>> down what tasks we'll need to do on a recurring basis: >> > >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both >instance >> & >> > >>>>>>>>>> dependencies) - we don't want to have a lot of old >versions >> > >>>>>>>>>> vulnerable >> > >>>>>>>>>> >> > >>>>>>>>> to >> > >>>>>>>> >> > >>>>>>>>> attacks/buggy >> > >>>>>>>>>> 2. Investigate breakages/regressions >> > >>>>>>>>>> (I'm betting there will be more things we'll discover - >let me >> > >>>>>>>>>> know if >> > >>>>>>>>>> you >> > >>>>>>>>>> have suggestions) >> > >>>>>>>>>> >> > >>>>>>>>>> There's a couple goals I see: >> > >>>>>>>>>> 1. We should only do sys admin work for things that give >us a >> > >>>>>>>>>> lot of >> > >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up >scripts >> for >> > >>>>>>>>>> data >> > >>>>>>>>>> stores >> > >>>>>>>>>> without a large community) >> > >>>>>>>>>> 2. We should do as much as possible of testing via >> > >>>>>>>>>> in-memory/embedded >> > >>>>>>>>>> testing (as you brought up). >> > >>>>>>>>>> 3. Reduce the amount of manual administration overhead >> > >>>>>>>>>> >> > >>>>>>>>>> As I discussed above, I think that integration >> tests/performance >> > >>>>>>>>>> benchmarks >> > >>>>>>>>>> are costly things that we should do only for the IO >transforms >> > >>>>>>>>>> with >> > >>>>>>>>>> >> > >>>>>>>>> large >> > >>>>>>>> >> > >>>>>>>>> amounts of community support/usage. Thus, I propose that >we >> > >>>>>>>>>> limit the >> > >>>>>>>>>> >> > >>>>>>>>> IO >> > >>>>>> >> > >>>>>>> transforms that get integration tests & performance >benchmarks to >> > >>>>>>>>>> >> > >>>>>>>>> those >> > >>>>> >> > >>>>>> that have community support for maintaining the data store >> > >>>>>>>>>> instances. >> > >>>>>>>>>> >> > >>>>>>>>>> We can enforce this organically using some simple rules: >> > >>>>>>>>>> 1. Investigating breakages/regressions: if a given >> > >>>>>>>>>> integration/perf >> > >>>>>>>>>> >> > >>>>>>>>> test >> > >>>>>> >> > >>>>>>> starts failing and no one investigates it within a set >period of >> > >>>>>>>>>> time >> > >>>>>>>>>> >> > >>>>>>>>> (a >> > >>>>>> >> > >>>>>>> week?), we disable the tests and shut off the data store >> > >>>>>>>>>> instances if >> > >>>>>>>>>> >> > >>>>>>>>> we >> > >>>>>> >> > >>>>>>> have instances running. When someone wants to step up and >> > >>>>>>>>>> support it >> > >>>>>>>>>> again, >> > >>>>>>>>>> they can fix the test, check it in, and re-enable the >test. >> > >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira >issue that >> > >>>>>>>>>> is just >> > >>>>>>>>>> "is >> > >>>>>>>>>> the IO Transform X data store up to date?" - if the jira >is >> not >> > >>>>>>>>>> resolved in >> > >>>>>>>>>> a set period of time (1 month?), the perf/integration >tests >> are >> > >>>>>>>>>> >> > >>>>>>>>> disabled, >> > >>>>>>>> >> > >>>>>>>>> and the data store instances shut off. >> > >>>>>>>>>> >> > >>>>>>>>>> This is pretty flexible - >> > >>>>>>>>>> * If a particular person or organization wants to >support an >> IO >> > >>>>>>>>>> transform, >> > >>>>>>>>>> they can. If a group of people all organically organize >to >> keep >> > >>>>>>>>>> the >> > >>>>>>>>>> >> > >>>>>>>>> tests >> > >>>>>>>> >> > >>>>>>>>> running, they can. >> > >>>>>>>>>> * It can be mostly automated - there's not a lot of >central >> > >>>>>>>>>> organizing >> > >>>>>>>>>> work >> > >>>>>>>>>> that needs to be done. >> > >>>>>>>>>> >> > >>>>>>>>>> Exposing the information about what IO transforms >currently >> have >> > >>>>>>>>>> >> > >>>>>>>>> running >> > >>>>>> >> > >>>>>>> IT/perf benchmarks on the website will let users know what >IO >> > >>>>>>>>>> >> > >>>>>>>>> transforms >> > >>>>>> >> > >>>>>>> are well supported. >> > >>>>>>>>>> >> > >>>>>>>>>> I like this solution, but I also recognize this is a >tricky >> > >>>>>>>>>> problem. >> > >>>>>>>>>> >> > >>>>>>>>> This >> > >>>>>>>> >> > >>>>>>>>> is something the community needs to be supportive of, so >I'm >> > >>>>>>>>>> open to >> > >>>>>>>>>> other >> > >>>>>>>>>> thoughts. >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> Simulating failures in real nodes ("programmatic tests >to >> > simulate >> > >>>>>>>>>> failure") >> > >>>>>>>>>> ----------------- >> > >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We >should >> > >>>>>>>>>> encourage a >> > >>>>>>>>>> design pattern separating out network/retry logic from >the >> main >> > IO >> > >>>>>>>>>> transform logic >> > >>>>>>>>>> >> > >>>>>>>>>> We *could* create instance failure in any container >management >> > >>>>>>>>>> >> > >>>>>>>>> software >> > >>>>> >> > >>>>>> - >> > >>>>>>>> >> > >>>>>>>>> we can use their programmatic APIs to determine which >> containers >> > >>>>>>>>>> are >> > >>>>>>>>>> running the instances, and ask them to kill the >container in >> > >>>>>>>>>> question. >> > >>>>>>>>>> >> > >>>>>>>>> A >> > >>>>>> >> > >>>>>>> slow node would be trickier, but I'm sure we could figure >it out >> > >>>>>>>>>> - for >> > >>>>>>>>>> example, add a network proxy that would delay responses. >> > >>>>>>>>>> >> > >>>>>>>>>> However, I would argue that this type of testing doesn't >gain >> > us a >> > >>>>>>>>>> lot, and >> > >>>>>>>>>> is complicated to set up. I think it will be easier to >test >> > >>>>>>>>>> network >> > >>>>>>>>>> errors >> > >>>>>>>>>> and retry behavior in unit tests for the IO transforms. >> > >>>>>>>>>> >> > >>>>>>>>>> Part of the way to handle this is to separate out the >read >> code >> > >>>>>>>>>> from >> > >>>>>>>>>> >> > >>>>>>>>> the >> > >>>>>> >> > >>>>>>> network code (eg. bigtable has BigtableService). If you put >the >> > >>>>>>>>>> >> > >>>>>>>>> "handle >> > >>>>> >> > >>>>>> errors/retry logic" code in a separate MySourceService >class, >> > >>>>>>>>>> you can >> > >>>>>>>>>> test >> > >>>>>>>>>> MySourceService on the wide variety of networks >errors/data >> > store >> > >>>>>>>>>> problems, >> > >>>>>>>>>> and then your main IO transform tests focus on the read >> behavior >> > >>>>>>>>>> and >> > >>>>>>>>>> handling the small set of errors the MySourceService >class >> will >> > >>>>>>>>>> >> > >>>>>>>>> return. >> > >>>>> >> > >>>>>> I also think we should focus on testing the IO Transform, >not >> > >>>>>>>>>> the data >> > >>>>>>>>>> store - if we kill a node in a data store, it's that >data >> > store's >> > >>>>>>>>>> problem, >> > >>>>>>>>>> not beam's problem. As you were pointing out, there are >a >> > *large* >> > >>>>>>>>>> number of >> > >>>>>>>>>> possible ways that a particular data store can fail, and >we >> > >>>>>>>>>> would like >> > >>>>>>>>>> >> > >>>>>>>>> to >> > >>>>>>>> >> > >>>>>>>>> support many different data stores. Rather than try to >test >> that >> > >>>>>>>>>> each >> > >>>>>>>>>> data >> > >>>>>>>>>> store behaves well, we should ensure that we handle >> > >>>>>>>>>> generic/expected >> > >>>>>>>>>> errors >> > >>>>>>>>>> in a graceful manner. >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> Ismaeal had a couple other quick comments/questions, >I'll >> answer >> > >>>>>>>>>> here >> > >>>>>>>>>> >> > >>>>>>>>> - >> > >>>>> >> > >>>>>> We can use this to test other runners running on multiple >> > >>>>>>>>>> machines - I >> > >>>>>>>>>> agree. This is also necessary for a good performance >benchmark >> > >>>>>>>>>> test. >> > >>>>>>>>>> >> > >>>>>>>>>> "providing the test machines to mount the cluster" - we >can >> > >>>>>>>>>> discuss >> > >>>>>>>>>> >> > >>>>>>>>> this >> > >>>>>> >> > >>>>>>> further, but one possible option is that google may be >willing to >> > >>>>>>>>>> >> > >>>>>>>>> donate >> > >>>>>> >> > >>>>>>> something to support this. >> > >>>>>>>>>> >> > >>>>>>>>>> "IO Consistency" - let's follow up on those questions in >> another >> > >>>>>>>>>> >> > >>>>>>>>> thread. >> > >>>>>> >> > >>>>>>> That's as much about the public interface we provide to >users as >> > >>>>>>>>>> >> > >>>>>>>>> anything >> > >>>>>>>> >> > >>>>>>>>> else. I agree with your sentiment that a user should be >able to >> > >>>>>>>>>> expect >> > >>>>>>>>>> predictable behavior from the different IO transforms. >> > >>>>>>>>>> >> > >>>>>>>>>> Thanks for everyone's questions/comments - I really am >excited >> > >>>>>>>>>> to see >> > >>>>>>>>>> that >> > >>>>>>>>>> people care about this :) >> > >>>>>>>>>> >> > >>>>>>>>>> Stephen >> > >>>>>>>>>> >> > >>>>>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía < >> [email protected] >> > > >> > >>>>>>>>>> >> > >>>>>>>>> wrote: >> > >>>>> >> > >>>>>> Hello, >> > >>>>>>>>>>> >> > >>>>>>>>>>> @Stephen Thanks for your proposal, it is really >interesting, >> I >> > >>>>>>>>>>> would >> > >>>>>>>>>>> really >> > >>>>>>>>>>> like to help with this. I have never played with >Kubernetes >> but >> > >>>>>>>>>>> this >> > >>>>>>>>>>> seems >> > >>>>>>>>>>> a really nice chance to do something useful with it. >> > >>>>>>>>>>> >> > >>>>>>>>>>> We (at Talend) are testing most of the IOs using simple >> > container >> > >>>>>>>>>>> >> > >>>>>>>>>> images >> > >>>>>>>> >> > >>>>>>>>> and in some particular cases ‘clusters’ of containers >using >> > >>>>>>>>>>> docker-compose >> > >>>>>>>>>>> (a little bit like Amit’s (2) proposal). It would be >really >> > >>>>>>>>>>> nice to >> > >>>>>>>>>>> >> > >>>>>>>>>> have >> > >>>>>>>> >> > >>>>>>>>> this at the Beam level, in particular to try to test more >> complex >> > >>>>>>>>>>> semantics, I don’t know how programmable kubernetes is >to >> > achieve >> > >>>>>>>>>>> this for >> > >>>>>>>>>>> example: >> > >>>>>>>>>>> >> > >>>>>>>>>>> Let’s think we have a cluster of Cassandra or Kafka >nodes, I >> > >>>>>>>>>>> would >> > >>>>>>>>>>> like to >> > >>>>>>>>>>> have programmatic tests to simulate failure (e.g. kill >a >> node), >> > >>>>>>>>>>> or >> > >>>>>>>>>>> simulate >> > >>>>>>>>>>> a really slow node, to ensure that the IO behaves as >expected >> > >>>>>>>>>>> in the >> > >>>>>>>>>>> Beam >> > >>>>>>>>>>> pipeline for the given runner. >> > >>>>>>>>>>> >> > >>>>>>>>>>> Another related idea is to improve IO consistency: >Today the >> > >>>>>>>>>>> different IOs >> > >>>>>>>>>>> have small differences in their failure behavior, I >really >> > >>>>>>>>>>> would like >> > >>>>>>>>>>> to be >> > >>>>>>>>>>> able to predict with more precision what will happen in >case >> of >> > >>>>>>>>>>> >> > >>>>>>>>>> errors, >> > >>>>>> >> > >>>>>>> e.g. what is the correct behavior if I am writing to a >Kafka >> > >>>>>>>>>>> node and >> > >>>>>>>>>>> there >> > >>>>>>>>>>> is a network partition, does the Kafka sink retries or >no ? >> and >> > >>>>>>>>>>> what >> > >>>>>>>>>>> if it >> > >>>>>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming >> > >>>>>>>>>>> checkpointing? >> > >>>>>>>>>>> Or do >> > >>>>>>>>>>> we guarantee exactly once writes somehow?, today I am >not >> sure >> > >>>>>>>>>>> about >> > >>>>>>>>>>> what >> > >>>>>>>>>>> happens (or if the expected behavior depends on the >runner), >> > >>>>>>>>>>> but well >> > >>>>>>>>>>> maybe >> > >>>>>>>>>>> it is just that I don’t know and we have tests to >ensure >> this. >> > >>>>>>>>>>> >> > >>>>>>>>>>> Of course both are really hard problems, but I think >with >> your >> > >>>>>>>>>>> proposal we >> > >>>>>>>>>>> can try to tackle them, as well as the performance >ones. And >> > >>>>>>>>>>> apart of >> > >>>>>>>>>>> the >> > >>>>>>>>>>> data stores, I think it will be also really nice to be >able >> to >> > >>>>>>>>>>> test >> > >>>>>>>>>>> >> > >>>>>>>>>> the >> > >>>>>> >> > >>>>>>> runners in a distributed manner. >> > >>>>>>>>>>> >> > >>>>>>>>>>> So what is the next step? How do you imagine such >integration >> > >>>>>>>>>>> tests? >> > >>>>>>>>>>> ? Who >> > >>>>>>>>>>> can provide the test machines so we can mount the >cluster? >> > >>>>>>>>>>> >> > >>>>>>>>>>> Maybe my ideas are a bit too far away for an initial >setup, >> but >> > >>>>>>>>>>> it >> > >>>>>>>>>>> will be >> > >>>>>>>>>>> really nice to start working on this. >> > >>>>>>>>>>> >> > >>>>>>>>>>> Ismael >> > >>>>>>>>>>> >> > >>>>>>>>>>> >> > >>>>>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela < >> > >>>>>>>>>>> [email protected] >> > >>>>>>>>>>> wrote: >> > >>>>>>>>>>> >> > >>>>>>>>>>> Hi Stephen, >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> I was wondering about how we plan to use the data >stores >> > across >> > >>>>>>>>>>>> >> > >>>>>>>>>>> executions. >> > >>>>>>>>>>> >> > >>>>>>>>>>>> Clearly, it's best to setup a new instance (container) >for >> > every >> > >>>>>>>>>>>> >> > >>>>>>>>>>> test, >> > >>>>>> >> > >>>>>>> running a "standalone" store (say HBase/Cassandra for >> > >>>>>>>>>>>> example), and >> > >>>>>>>>>>>> once >> > >>>>>>>>>>>> the test is done, teardown the instance. It should >also be >> > >>>>>>>>>>>> agnostic >> > >>>>>>>>>>>> >> > >>>>>>>>>>> to >> > >>>>>> >> > >>>>>>> the >> > >>>>>>>>>>> >> > >>>>>>>>>>>> runtime environment (e.g., Docker on Kubernetes). >> > >>>>>>>>>>>> I'm wondering though what's the overhead of managing >such a >> > >>>>>>>>>>>> >> > >>>>>>>>>>> deployment >> > >>>>>> >> > >>>>>>> which could become heavy and complicated as more IOs are >> > >>>>>>>>>>>> supported >> > >>>>>>>>>>>> >> > >>>>>>>>>>> and >> > >>>>>> >> > >>>>>>> more >> > >>>>>>>>>>> >> > >>>>>>>>>>>> test cases introduced. >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> Another way to go would be to have small clusters of >> different >> > >>>>>>>>>>>> data >> > >>>>>>>>>>>> >> > >>>>>>>>>>> stores >> > >>>>>>>>>>> >> > >>>>>>>>>>>> and run against new "namespaces" (while lazily >evicting old >> > >>>>>>>>>>>> ones), >> > >>>>>>>>>>>> but I >> > >>>>>>>>>>>> think this is less likely as maintaining a distributed >> > instance >> > >>>>>>>>>>>> >> > >>>>>>>>>>> (even >> > >>>>> >> > >>>>>> a >> > >>>>>>>> >> > >>>>>>>>> small one) for each data store sounds even more complex. >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> A third approach would be to to simply have an >"embedded" >> > >>>>>>>>>>>> in-memory >> > >>>>>>>>>>>> instance of a data store as part of a test that runs >against >> > it >> > >>>>>>>>>>>> (such as >> > >>>>>>>>>>>> >> > >>>>>>>>>>> an >> > >>>>>>>>>>> >> > >>>>>>>>>>>> embedded Kafka, though not a data store). >> > >>>>>>>>>>>> This is probably the simplest solution in terms of >> > >>>>>>>>>>>> orchestration, >> > >>>>>>>>>>>> but it >> > >>>>>>>>>>>> depends on having a proper "embedded" implementation >for an >> > IO. >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> Does this make sense to you ? have you considered it ? >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> Thanks, >> > >>>>>>>>>>>> Amit >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré < >> > >>>>>>>>>>>> >> > >>>>>>>>>>> [email protected] >> > >>>>> >> > >>>>>> wrote: >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> Hi Stephen, >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> as already discussed a bit together, it sounds great >! >> > >>>>>>>>>>>>> Especially I >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>> like >> > >>>>>>>>>>> >> > >>>>>>>>>>>> it as a both integration test platform and good >coverage for >> > >>>>>>>>>>>>> IOs. >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> I'm very late on this but, as said, I will share with >you >> my >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>> Marathon >> > >>>>>> >> > >>>>>>> JSON and Mesos docker images. >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> By the way, I started to experiment a bit kubernetes >and >> > >>>>>>>>>>>>> swamp but >> > >>>>>>>>>>>>> it's >> > >>>>>>>>>>>>> not yet complete. I will share what I have on the >same >> github >> > >>>>>>>>>>>>> repo. >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> Thanks ! >> > >>>>>>>>>>>>> Regards >> > >>>>>>>>>>>>> JB >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote: >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Hi everyone! >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Currently we have a good set of unit tests for our >IO >> > >>>>>>>>>>>>>> Transforms - >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> those >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> tend to run against in-memory versions of the data >stores. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> However, >> > >>>>> >> > >>>>>> we'd >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> like to further increase our test coverage to include >> > >>>>>>>>>>>>>> running them >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> against >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> real instances of the data stores that the IO >Transforms >> > work >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> against >> > >>>>>>>> >> > >>>>>>>>> (e.g. >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> cassandra, mongodb, kafka, etc…), which means we'll >need >> to >> > >>>>>>>>>>>>>> have >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> real >> > >>>>>>>> >> > >>>>>>>>> instances of various data stores. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Additionally, if we want to do performance >regression >> > >>>>>>>>>>>>>> detection, >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> it's >> > >>>>>>>> >> > >>>>>>>>> important to have instances of the services that behave >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> realistically, >> > >>>>>>>>>>> >> > >>>>>>>>>>>> which isn't true of in-memory or dev versions of the >> services. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Proposed solution >> > >>>>>>>>>>>>>> ------------------------- >> > >>>>>>>>>>>>>> If we accept this proposal, we would create an >> > >>>>>>>>>>>>>> infrastructure for >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> running >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> real instances of data stores inside of containers, >using >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> container >> > >>>>> >> > >>>>>> management software like mesos/marathon, kubernetes, docker >> > >>>>>>>>>>>>>> swarm, >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> etc… >> > >>>>>>>>>>> >> > >>>>>>>>>>>> to >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> manage the instances. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> This would enable us to build integration tests that >run >> > >>>>>>>>>>>>>> against >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> those >> > >>>>>>>>>>> >> > >>>>>>>>>>>> real >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> instances and performance tests that run against >those >> real >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> instances >> > >>>>>>>> >> > >>>>>>>>> (like >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> those that Jason Kuster is proposing elsewhere.) >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Why do we need one centralized set of instances vs >just >> > having >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> various >> > >>>>>>>>>>> >> > >>>>>>>>>>>> people host their own instances? >> > >>>>>>>>>>>>>> ------------------------- >> > >>>>>>>>>>>>>> Reducing flakiness of tests is key. By not having >> > dependencies >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> from >> > >>>>> >> > >>>>>> the >> > >>>>>>>>>>> >> > >>>>>>>>>>>> core project on external services/instances of data >stores >> > >>>>>>>>>>>>>> we have >> > >>>>>>>>>>>>>> guaranteed access to the services and the group can >fix >> > issues >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> that >> > >>>>> >> > >>>>>> arise. >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> An exception would be something that has an ops team >> > >>>>>>>>>>>>>> supporting it >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> (eg, >> > >>>>>>>>>>> >> > >>>>>>>>>>>> AWS, Google Cloud or other professionally managed >service) - >> > >>>>>>>>>>>>>> those >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> we >> > >>>>>>>> >> > >>>>>>>>> trust >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> will be stable. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> There may be a lot of different data stores needed - >how >> > >>>>>>>>>>>>>> will we >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> maintain >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> them? >> > >>>>>>>>>>>>>> ------------------------- >> > >>>>>>>>>>>>>> It will take work above and beyond that of a normal >set of >> > >>>>>>>>>>>>>> unit >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> tests >> > >>>>>>>> >> > >>>>>>>>> to >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> build and maintain integration/performance tests & >their >> data >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> store >> > >>>>> >> > >>>>>> instances. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Setup & maintenance of the data store containers and >data >> > >>>>>>>>>>>>>> store >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> instances >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> on it must be automated. It also has to be as simple >of a >> > >>>>>>>>>>>>>> setup as >> > >>>>>>>>>>>>>> possible, and we should avoid hand tweaking the >> containers - >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> expecting >> > >>>>>>>>>>> >> > >>>>>>>>>>>> checked in scripts/dockerfiles is key. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Aligned with the community ownership approach of >Apache, >> as >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> members >> > >>>>> >> > >>>>>> of >> > >>>>>>>>>>> >> > >>>>>>>>>>>> the >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> community are excited to contribute & maintain those >tests >> > >>>>>>>>>>>>>> and the >> > >>>>>>>>>>>>>> integration/performance tests, people will be able >to step >> > >>>>>>>>>>>>>> up and >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> do >> > >>>>>> >> > >>>>>>> that. >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> If there is no longer support for maintaining a >particular >> > >>>>>>>>>>>>>> set of >> > >>>>>>>>>>>>>> integration & performance tests and their data store >> > >>>>>>>>>>>>>> instances, >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> then >> > >>>>>> >> > >>>>>>> we >> > >>>>>>>>>>> >> > >>>>>>>>>>>> can >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> disable those tests. We may document on the website >what >> IO >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> Transforms >> > >>>>>>>>>>> >> > >>>>>>>>>>>> have >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> current integration/performance tests so users know >what >> > >>>>>>>>>>>>>> level of >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> testing >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> the various IO Transforms have. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> What about requirements for the container management >> > software >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> itself? >> > >>>>>>>> >> > >>>>>>>>> ------------------------- >> > >>>>>>>>>>>>>> * We should have the data store instances themselves >in >> > >>>>>>>>>>>>>> Docker. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> Docker >> > >>>>>>>>>>> >> > >>>>>>>>>>>> allows new instances to be spun up in a quick, >reproducible >> > way >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> and >> > >>>>> >> > >>>>>> is >> > >>>>>>>>>>> >> > >>>>>>>>>>>> fairly platform independent. It has wide support from >a >> > >>>>>>>>>>>>>> variety of >> > >>>>>>>>>>>>>> different container management services. >> > >>>>>>>>>>>>>> * As little admin work required as possible. >Crashing >> > >>>>>>>>>>>>>> instances >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> should >> > >>>>>>>>>>> >> > >>>>>>>>>>>> be >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> restarted, setup should be simple, everything >possible >> > >>>>>>>>>>>>>> should be >> > >>>>>>>>>>>>>> scripted/scriptable. >> > >>>>>>>>>>>>>> * Logs and test output should be on a publicly >available >> > >>>>>>>>>>>>>> website, >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> without >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> needing to log into test execution machine. >Centralized >> > >>>>>>>>>>>>>> capture of >> > >>>>>>>>>>>>>> monitoring info/logs from instances running in the >> > containers >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> would >> > >>>>> >> > >>>>>> support >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> this. Ideally, this would just be supported by the >> container >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> software >> > >>>>>>>> >> > >>>>>>>>> out >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> of the box. >> > >>>>>>>>>>>>>> * It'd be useful to have good persistent volume in >the >> > >>>>>>>>>>>>>> container >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> management >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> software so that databases don't have to reload >large data >> > >>>>>>>>>>>>>> sets >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> every >> > >>>>>>>> >> > >>>>>>>>> time. >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> * The containers may be a place to execute runners >> > >>>>>>>>>>>>>> themselves if >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> we >> > >>>>> >> > >>>>>> need >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> larger runner instances, so it should play well with >Spark, >> > >>>>>>>>>>>>>> Flink, >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> etc… >> > >>>>>>>>>>> >> > >>>>>>>>>>>> As I discussed earlier on the mailing list, it looks >like >> > >>>>>>>>>>>>>> hosting >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> docker >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> containers on kubernetes, docker swarm or >mesos+marathon >> > >>>>>>>>>>>>>> would be >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>> a >> > >>>>> >> > >>>>>> good >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> solution. >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Thanks, >> > >>>>>>>>>>>>>> Stephen Sisk >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> -- >> > >>>>>>>>>>>>> Jean-Baptiste Onofré >> > >>>>>>>>>>>>> [email protected] >> > >>>>>>>>>>>>> http://blog.nanthrax.net >> > >>>>>>>>>>>>> Talend - http://www.talend.com >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> -- >> > >>>>>>>> Jean-Baptiste Onofré >> > >>>>>>>> [email protected] >> > >>>>>>>> http://blog.nanthrax.net >> > >>>>>>>> Talend - http://www.talend.com >> > >>>>>>>> >> > >>>>>>>> >> > >>> >> > > >> > > >> > >>
