Re: Hosting data stores for IO Transform testing

Jean-Baptiste Onofré Wed, 18 Jan 2017 08:37:27 -0800

⁣Hi Ismael

Stephen will reply with details but I know he did a comparison and evaluate 
different options.


He tested with the jdbc Io itests.

Regards
JB

On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <[email protected]> wrote:
>Thanks for your analysis Stephen, good arguments / references.
>
>One quick question. Have you checked the APIs of both
>(Mesos/Kubernetes) to
>see
>if we can do programmatically do more complex tests (I suppose so, but
>you
>don't mention how easy or if those are possible), for example to
>simulate a
>slow networking slave (to test stragglers), or to arbitrarily kill one
>slave (e.g. if I want to test the correct behavior of a runner/IO that
>is
>reading from it) ?
>
>Other missing point in the review is the availability of ready to play
>packages,
>I think in this area mesos/dcos seems more advanced no? I haven't
>looked
>recently but at least 6 months ago there were not many helm packages
>ready
>for
>example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
>etc). Has
>this been improved ? because preparing this also is a considerable
>amount of
>work on the other hand this could be also a chance to contribute to
>kubernetes.
>
>Regards,
>Ismaël
>
>
>
>On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <[email protected]>
>wrote:
>
>> hi!
>>
>> I've been continuing this investigation, and have some more info to
>report,
>> and hopefully we can start making some decisions.
>>
>> To support performance testing, I've been investigating
>mesos+marathon and
>> kubernetes for running data stores in their high availability mode. I
>have
>> been examining features that kubernetes/mesos+marathon use to support
>this.
>>
>> Setting up a multi-node cluster in a high availability mode tends to
>be
>> more expensive time-wise than the single node instances I've played
>around
>> with in the past. Rather than do a full build out with both
>kubernetes and
>> mesos, I'd like to pick one of the two options to build the prototype
>> cluster with. If the prototype doesn't go well, we could still go
>back to
>> the other option, but I'd like to change us from a mode of "let's
>look at
>> all the options" to one of "here's the favorite, let's prove that
>works for
>> us".
>>
>> Below are the features that I've seen are important to multi-node
>instances
>> of data stores. I'm sure other folks on the list have done this
>before, so
>> feel free to pipe up if I'm missing a good solution to a problem.
>>
>> DNS/Discovery
>>
>> --------------------
>>
>> Necessary for talking between nodes (eg, cassandra nodes all need to
>be
>> able to talk to a set of seed nodes.)
>>
>> * Kubernetes has built-in DNS/discovery between nodes.
>>
>> * Mesos has supports this via mesos-dns, which isn't a part of core
>mesos,
>> but is in dcos, which is the mesos distribution I've been using and
>that I
>> would expect us to use.
>>
>> Instances properly distributed across nodes
>>
>> ------------------------------------------------------------
>>
>> If multiple instances of a data source end up on the same underlying
>VM, we
>> may not get good performance out of those instances since the
>underlying VM
>> may be more taxed than other VMs.
>>
>> * Kubernetes has a beta feature StatefulSets[1] which allow for
>containers
>> distributed so that there's one container per underlying machine (as
>well
>> as a lot of other useful features like easy stable dns names.)
>>
>> * Mesos can support this via the built in UNIQUE constraint [2]
>>
>> Load balancing
>>
>> --------------------
>>
>> Incoming requests from users need to be distributed to the various
>machines
>> - this is important for many data stores' high availability modes.
>>
>> * Kubernetes supports easily hooking up to an external load balancer
>when
>> on a cloud (and can be configured to work with a built-in load
>balancer if
>> not)
>>
>> * Mesos supports this via marathon-lb [3], which is an install-able
>package
>> in DC/OS
>>
>> Persistent Volumes tied to specific instances
>>
>> ------------------------------------------------------------
>>
>> Databases often need persistent state (for example to store the data
>:), so
>> it's an important part of running our service.
>>
>> * Kubernetes StatefulSets supports this
>>
>> * Mesos+marathon apps with persistent volumes supports this [4] [5]
>>
>> As I mentioned above, I'd like to focus on either kubernetes or mesos
>for
>> my investigation, and as I go further along, I'm seeing kubernetes as
>> better suited to our needs.
>>
>> (1) It supports more of the features we want out of the box and with
>> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
>> requires marathon-lb to be installed and mesos-dns to be configured.
>>
>> (2) I'm also finding that there seem to be more examples of using
>> kubernetes to solve the types of problems we're working on. This is
>> somewhat subjective, but in my experience as I've tried to learn both
>> kubernetes and mesos, I personally found it generally easier to get
>> kubernetes running than mesos due to the tutorials/examples available
>for
>> kubernetes.
>>
>> (3) Lower cost of initial setup - as I discussed in a previous
>mail[6],
>> kubernetes was far easier to get set up even when I knew the exact
>steps.
>> Mesos took me around 27 steps [7], which involved a lot of config
>that was
>> easy to get wrong (it took me about 5 tries to get all the steps
>correct in
>> one go.) Kubernetes took me around 8 steps and very little config.
>>
>> Given that, I'd like to focus my investigation/prototyping on
>Kubernetes.
>> To
>> be clear, it's fairly close and I think both Mesos and Kubernetes
>could
>> support what we need, so if we run into issues with kubernetes, Mesos
>still
>> seems like a viable option that we could fall back to.
>>
>> Thanks,
>> Stephen
>>
>>
>> [1] Kubernetes StatefulSets
>>
>https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/
>>
>> [2] mesos unique constraint -
>> https://mesosphere.github.io/marathon/docs/constraints.html
>>
>> [3]
>> https://mesosphere.github.io/marathon/docs/service-
>> discovery-load-balancing.html
>>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
>>
>> [4]
>https://mesosphere.github.io/marathon/docs/persistent-volumes.html
>>
>> [5]
>https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
>>
>> [6] Container Orchestration software for hosting data stores
>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>
>> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
>>
>>
>> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <[email protected]>
>wrote:
>>
>> > Just a quick drive-by comment: how tests are laid out has
>non-trivial
>> > tradeoffs on how/where continuous integration runs, and how results
>are
>> > integrated into the tooling. The current state is certainly not
>ideal
>> > (e.g., due to multiple test executions some links in Jenkins point
>where
>> > they shouldn't), but most other alternatives had even bigger
>drawbacks at
>> > the time. If someone has great ideas that don't explode the number
>of
>> > modules, please share ;-)
>> >
>> > On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
><[email protected]>
>> > wrote:
>> >
>> > > Hi Stephen,
>> > >
>> > > Thanks for taking the time to comment.
>> > >
>> > > My comments are bellow in the email:
>> > >
>> > >
>> > > Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
>> > >
>> > >> hey Etienne -
>> > >>
>> > >> thanks for your thoughts and thanks for sharing your
>experiences. I
>> > >> generally agree with what you're saying. Quick comments below:
>> > >>
>> > >> IT are stored alongside with UT in src/test directory of the IO
>but
>> they
>> > >>>
>> > >> might go to dedicated module, waiting for a consensus
>> > >> I don't have a strong opinion or feel that I've worked enough
>with
>> maven
>> > >> to
>> > >> understand all the consequences - I'd love for someone with more
>maven
>> > >> experience to weigh in. If this becomes blocking, I'd say check
>it in,
>> > and
>> > >> we can refactor later if it proves problematic.
>> > >>
>> > > Sure, not a blocking point, it could be refactored afterwards.
>Just as
>> a
>> > > reminder, JB mentioned that storing IT in separate module allows
>to
>> have
>> > > more coherence between all IT (same behavior) and to do cross IO
>> > > integration tests. JB, have you experienced some long term
>drawbacks of
>> > > storing IT in a separate module, like, for example, more
>difficult
>> > > maintenance due to "distance" with production code?
>> > >
>> > >
>> > >>   Also IMHO, it is better that tests load/clean data than doing
>some
>> > >>>
>> > >> assumptions about the running order of the tests.
>> > >> I definitely agree that we don't want to make assumptions about
>the
>> > >> running
>> > >> order of the tests - that way lies pain. :) It will be
>interesting to
>> > see
>> > >> how the performance tests work out since they will need more
>data (and
>> > >> thus
>> > >> loading data can take much longer.)
>> > >>
>> > > Yes, performance testing might push in the direction of data
>loading
>> from
>> > > outside the tests due to loading time.
>> > >
>> > >>   This should also be an easier problem
>> > >> for read tests than for write tests - if we have long running
>> instances,
>> > >> read tests don't really need cleanup. And if write tests only
>write a
>> > >> small
>> > >> amount of data, as long as we are sure we're writing to uniquely
>> > >> identifiable locations (ie, new table per test or something
>similar),
>> we
>> > >> can clean up the write test data on a slower schedule.
>> > >>
>> > > I agree
>> > >
>> > >>
>> > >> this will tend to go to the direction of long running data store
>> > >>>
>> > >> instances rather than data store instances started (and
>optionally
>> > loaded)
>> > >> before tests.
>> > >> It may be easiest to start with a "data stores stay running"
>> > >> implementation, and then if we see issues with that move towards
>tests
>> > >> that
>> > >> start/stop the data stores on each run. One thing I'd like to
>make
>> sure
>> > is
>> > >> that we're not manually tweaking the configurations for data
>stores.
>> One
>> > >> way we could do that is to destroy/recreate the data stores on a
>> slower
>> > >> schedule - maybe once per week. That way if the script is
>changed or
>> the
>> > >> data store instances are changed, we'd be able to detect it
>relatively
>> > >> soon
>> > >> while still removing the need for the tests to manage the data
>stores.
>> > >>
>> > > I agree. In addition to configuration manual tweaking, there
>might be
>> > > cases in which a data store re-partition data during a test or
>after
>> some
>> > > tests while the dataset changes. The IO must be tolerant to that
>but
>> the
>> > > asserts (number of bundles for example) in test must not fail in
>that
>> > case.
>> > > I would also prefer if possible that the tests do not manage data
>> stores
>> > > (not setup them, not start them, not stop them)
>> > >
>> > >
>> > >> as a general note, I suspect many of the folks in the states
>will be
>> on
>> > >> holiday until Jan 2nd/3rd.
>> > >>
>> > >> S
>> > >>
>> > >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
><[email protected]
>> >
>> > >> wrote:
>> > >>
>> > >> Hi,
>> > >>>
>> > >>> Recently we had a discussion about integration tests of IOs.
>I'm
>> > >>> preparing a PR for integration tests of the elasticSearch IO
>> > >>> (
>> > >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
>> > >>> LASTICSEARCH-IO
>> > >>> as a first shot) which are very important IMHO because they
>helped
>> > catch
>> > >>> some bugs that UT could not (volume, data store instance
>sharing,
>> real
>> > >>> data store instance ...)
>> > >>>
>> > >>> I would like to have your thoughts/remarks about points bellow.
>Some
>> of
>> > >>> these points are also discussed here
>> > >>>
>> > >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
>> > >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
>> > >>> :
>> > >>>
>> > >>> - UT and IT have a similar architecture, but while UT focus on
>> testing
>> > >>> the correct behavior of the code including corner cases and use
>> > embedded
>> > >>> in memory data store, IT assume that the behavior is correct
>(strong
>> > UT)
>> > >>> and focus on higher volume testing and testing against real
>data
>> store
>> > >>> instance(s)
>> > >>>
>> > >>> - For now, IT are stored alongside with UT in src/test
>directory of
>> the
>> > >>> IO but they might go to dedicated module, waiting for a
>consensus.
>> > Maven
>> > >>> is not configured to run them automatically because data store
>is not
>> > >>> available on jenkins server yet
>> > >>>
>> > >>> - For now, they only use DirectRunner, but they will  be run
>against
>> > >>> each runner.
>> > >>>
>> > >>> - IT do not setup data store instance (like stated in the above
>> > >>> document) they assume that one is already running (hardcoded
>> > >>> configuration in test for now, waiting for a common solution to
>pass
>> > >>> configuration to IT). A docker container script is provided in
>the
>> > >>> contrib directory as a starting point to whatever orchestration
>> > software
>> > >>> will be chosen.
>> > >>>
>> > >>> - IT load and clean test data before and after each test if
>needed.
>> It
>> > >>> is simpler to do so because some tests need empty data store
>(write
>> > >>> test) and because, as discussed in the document, tests might
>not be
>> the
>> > >>> only users of the data store. Also IMHO, it is better that
>tests
>> > >>> load/clean data than doing some assumptions about the running
>order
>> of
>> > >>> the tests.
>> > >>>
>> > >>> If we generalize this pattern to all IT tests, this will tend
>to go
>> to
>> > >>> the direction of long running data store instances rather than
>data
>> > >>> store instances started (and optionally loaded) before tests.
>> > >>>
>> > >>> Besides if we where to change our minds and load data from
>outside
>> the
>> > >>> tests, a logstash script is provided.
>> > >>>
>> > >>> If you have any thoughts or remarks I'm all ears :)
>> > >>>
>> > >>> Regards,
>> > >>>
>> > >>> Etienne
>> > >>>
>> > >>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
>> > >>>
>> > >>>> Hi Stephen,
>> > >>>>
>> > >>>> the purpose of having in a specific module is to share
>resources and
>> > >>>> apply the same behavior from IT perspective and be able to
>have IT
>> > >>>> "cross" IO (for instance, reading from JMS and sending to
>Kafka, I
>> > >>>> think that's the key idea for integration tests).
>> > >>>>
>> > >>>> For instance, in Karaf, we have:
>> > >>>> - utest in each module
>> > >>>> - itest module containing itests for all modules all together
>> > >>>>
>> > >>>> Regards
>> > >>>> JB
>> > >>>>
>> > >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
>> > >>>>
>> > >>>>> Hi Etienne,
>> > >>>>>
>> > >>>>> thanks for following up and answering my questions.
>> > >>>>>
>> > >>>>> re: where to store integration tests - having them all in a
>> separate
>> > >>>>> module
>> > >>>>> is an interesting idea. I couldn't find JB's comments about
>moving
>> > them
>> > >>>>> into a separate module in the PR - can you share the reasons
>for
>> > >>>>> doing so?
>> > >>>>> The IO integration/perf tests so it does seem like they'll
>need to
>> be
>> > >>>>> treated in a special manner, but given that there is already
>an IO
>> > >>>>> specific
>> > >>>>> module, it may just be that we need to treat all the ITs in
>the IO
>> > >>>>> module
>> > >>>>> the same way. I don't have strong opinions either way right
>now.
>> > >>>>>
>> > >>>>> S
>> > >>>>>
>> > >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
>> > [email protected]>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>> Hi guys,
>> > >>>>>
>> > >>>>> @Stephen: I addressed all your comments directly in the PR,
>thanks!
>> > >>>>> I just wanted to comment here about the docker image I used:
>the
>> only
>> > >>>>> official Elastic image contains only ElasticSearch. But for
>> testing I
>> > >>>>> needed logstash (for ingestion) and kibana (not for
>integration
>> > tests,
>> > >>>>> but to easily test REST requests to ES using sense). This is
>why I
>> > use
>> > >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
>isreleased
>> > >>>>> under
>> > >>>>> theapache 2 license.
>> > >>>>>
>> > >>>>>
>> > >>>>> Besides, there is also a point about where to store
>integration
>> > tests:
>> > >>>>> JB proposed in the PR to store integration tests to dedicated
>> module
>> > >>>>> rather than directly in the IO module (like I did).
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> Etienne
>> > >>>>>
>> > >>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
>> > >>>>>
>> > >>>>>> hey!
>> > >>>>>>
>> > >>>>>> thanks for sending this. I'm very excited to see this
>change. I
>> > >>>>>> added some
>> > >>>>>> detail-oriented code review comments in addition to what
>I've
>> > >>>>>> discussed
>> > >>>>>> here.
>> > >>>>>>
>> > >>>>>> The general goal is to allow for re-usable instantiation of
>> > particular
>> > >>>>>>
>> > >>>>> data
>> > >>>>>
>> > >>>>>> store instances and this seems like a good start. Looks like
>you
>> > >>>>>> also have
>> > >>>>>> a script to generate test data for your tests - that's
>great.
>> > >>>>>>
>> > >>>>>> The next steps (definitely not blocking your work) will be
>to have
>> > >>>>>> ways to
>> > >>>>>> create instances from the docker images you have here, and
>use
>> them
>> > >>>>>> in the
>> > >>>>>> tests. We'll need support in the test framework for that
>since
>> it'll
>> > >>>>>> be
>> > >>>>>> different on developer machines and in the beam jenkins
>cluster,
>> but
>> > >>>>>> your
>> > >>>>>> scripts here allow someone running these tests locally to
>not have
>> > to
>> > >>>>>>
>> > >>>>> worry
>> > >>>>>
>> > >>>>>> about getting the instance set up and can manually adjust,
>so this
>> > is
>> > >>>>>> a
>> > >>>>>> good incremental step.
>> > >>>>>>
>> > >>>>>> I have some thoughts now that I'm reviewing your scripts
>(that I
>> > >>>>>> didn't
>> > >>>>>> have previously, so we are learning this together):
>> > >>>>>> * It may be useful to try and document why we chose a
>particular
>> > >>>>>> docker
>> > >>>>>> image as the base (ie, "this is the official supported
>elastic
>> > search
>> > >>>>>> docker image" or "this image has several data stores
>together that
>> > >>>>>> can be
>> > >>>>>> used for a couple different tests")  - I'm curious as to
>whether
>> the
>> > >>>>>> community thinks that is important
>> > >>>>>>
>> > >>>>>> One thing that I called out in the comment that's worth
>mentioning
>> > >>>>>> on the
>> > >>>>>> larger list - if you want to specify which specific runners
>a test
>> > >>>>>> uses,
>> > >>>>>> that can be controlled in the pom for the module. I updated
>the
>> > >>>>>> testing
>> > >>>>>>
>> > >>>>> doc
>> > >>>>>
>> > >>>>>> mentioned previously in this thread with a TODO to talk
>about this
>> > >>>>>> more. I
>> > >>>>>> think we should also make it so that IO modules have that
>> > >>>>>> automatically,
>> > >>>>>>
>> > >>>>> so
>> > >>>>>
>> > >>>>>> developers don't have to worry about it.
>> > >>>>>>
>> > >>>>>> S
>> > >>>>>>
>> > >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
>> > [email protected]>
>> > >>>>>>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> Stephen,
>> > >>>>>>
>> > >>>>>> As discussed, I added injection script, docker containers
>scripts
>> > and
>> > >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
>> > >>>>>> <
>> > >>>>>>
>> > >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
>> > >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
>> > >>> d824cefcb3ed0b9
>> > >>>
>> > >>>> directory in that PR:
>> > >>>>>> https://github.com/apache/incubator-beam/pull/1439.
>> > >>>>>>
>> > >>>>>> These work well but they are first shot. Do you have any
>comments
>> > >>>>>> about
>> > >>>>>> those?
>> > >>>>>>
>> > >>>>>> Besides I am not very sure that these files should be in the
>IO
>> > itself
>> > >>>>>> (even in contrib directory, out of maven source
>directories). Any
>> > >>>>>>
>> > >>>>> thoughts?
>> > >>>>>
>> > >>>>>> Thanks,
>> > >>>>>>
>> > >>>>>> Etienne
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
>> > >>>>>>
>> > >>>>>>> It's great to hear more experiences.
>> > >>>>>>>
>> > >>>>>>> I'm also glad to hear that people see real value in the
>high
>> > >>>>>>> volume/performance benchmark tests. I tried to capture that
>in
>> the
>> > >>>>>>>
>> > >>>>>> Testing
>> > >>>>>
>> > >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
>> > >>>>>>>
>> > >>>>>>> It does generally sound like we're in agreement here. Areas
>of
>> > >>>>>>> discussion
>> > >>>>>>>
>> > >>>>>> I
>> > >>>>>>
>> > >>>>>>> see:
>> > >>>>>>> 1.  People like the idea of bringing up fresh instances for
>each
>> > test
>> > >>>>>>> rather than keeping instances running all the time, since
>that
>> > >>>>>>> ensures no
>> > >>>>>>> contamination between tests. That seems reasonable to me.
>If we
>> see
>> > >>>>>>> flakiness in the tests or we note that setting up/tearing
>down
>> > >>>>>>> instances
>> > >>>>>>>
>> > >>>>>> is
>> > >>>>>>
>> > >>>>>>> taking a lot of time,
>> > >>>>>>> 2. Deciding on cluster management software/orchestration
>software
>> > - I
>> > >>>>>>>
>> > >>>>>> want
>> > >>>>>
>> > >>>>>> to make sure we land on the right tool here since choosing
>the
>> > >>>>>>> wrong tool
>> > >>>>>>> could result in administration of the instances taking more
>> work. I
>> > >>>>>>>
>> > >>>>>> suspect
>> > >>>>>>
>> > >>>>>>> that's a good place for a follow up discussion, so I'll
>start a
>> > >>>>>>> separate
>> > >>>>>>> thread on that. I'm happy with whatever tool we choose, but
>I
>> want
>> > to
>> > >>>>>>>
>> > >>>>>> make
>> > >>>>>
>> > >>>>>> sure we take a moment to consider different options and have
>a
>> > >>>>>>> reason for
>> > >>>>>>> choosing one.
>> > >>>>>>>
>> > >>>>>>> Etienne - thanks for being willing to port your
>creation/other
>> > >>>>>>> scripts
>> > >>>>>>> over. You might be a good early tester of whether this
>system
>> works
>> > >>>>>>> well
>> > >>>>>>> for everyone.
>> > >>>>>>>
>> > >>>>>>> Stephen
>> > >>>>>>>
>> > >>>>>>> [1]  Reasons for Beam Test Strategy -
>> > >>>>>>>
>> > >>>>>>>
>https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
>> > >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
>> > >>>
>> > >>>>
>> > >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
>> > >>>>>>> <[email protected]>
>> > >>>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>> I second Etienne there.
>> > >>>>>>>>
>> > >>>>>>>> We worked together on the ElasticsearchIO and definitely,
>the
>> high
>> > >>>>>>>> valuable test we did were integration tests with ES on
>docker
>> and
>> > >>>>>>>> high
>> > >>>>>>>> volume.
>> > >>>>>>>>
>> > >>>>>>>> I think we have to distinguish the two kinds of tests:
>> > >>>>>>>> 1. utests are located in the IO itself and basically they
>should
>> > >>>>>>>> cover
>> > >>>>>>>> the core behaviors of the IO
>> > >>>>>>>> 2. itests are located as contrib in the IO (they could be
>part
>> of
>> > >>>>>>>> the IO
>> > >>>>>>>> but executed by the integration-test plugin or a specific
>> profile)
>> > >>>>>>>> that
>> > >>>>>>>> deals with "real" backend and high volumes. The resources
>> required
>> > >>>>>>>> by
>> > >>>>>>>> the itest can be bootstrapped by Jenkins (for instance
>using
>> > >>>>>>>> Mesos/Marathon and docker images as already discussed, and
>it's
>> > >>>>>>>> what I'm
>> > >>>>>>>> doing on my own "server").
>> > >>>>>>>>
>> > >>>>>>>> It's basically what Stephen described.
>> > >>>>>>>>
>> > >>>>>>>> We have to not relay only on itest: utests are very
>important
>> and
>> > >>>>>>>> they
>> > >>>>>>>> validate the core behavior.
>> > >>>>>>>>
>> > >>>>>>>> My $0.01 ;)
>> > >>>>>>>>
>> > >>>>>>>> Regards
>> > >>>>>>>> JB
>> > >>>>>>>>
>> > >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
>> > >>>>>>>>
>> > >>>>>>>>> Hi Stephen,
>> > >>>>>>>>>
>> > >>>>>>>>> I like your proposition very much and I also agree that
>docker
>> +
>> > >>>>>>>>> some
>> > >>>>>>>>> orchestration software would be great !
>> > >>>>>>>>>
>> > >>>>>>>>> On the elasticsearchIO (PR to be created this week) there
>is
>> > docker
>> > >>>>>>>>> container creation scripts and logstash data ingestion
>script
>> for
>> > >>>>>>>>> IT
>> > >>>>>>>>> environment available in contrib directory alongside with
>> > >>>>>>>>> integration
>> > >>>>>>>>> tests themselves. I'll be happy to make them compliant to
>new
>> IT
>> > >>>>>>>>> environment.
>> > >>>>>>>>>
>> > >>>>>>>>> What you say bellow about the need for external IT
>environment
>> is
>> > >>>>>>>>> particularly true. As an example with ES what came out in
>first
>> > >>>>>>>>> implementation was that there were problems starting at
>some
>> high
>> > >>>>>>>>>
>> > >>>>>>>> volume
>> > >>>>>
>> > >>>>>> of data (timeouts, ES windowing overflow...) that could not
>have
>> be
>> > >>>>>>>>>
>> > >>>>>>>> seen
>> > >>>>>
>> > >>>>>> on embedded ES version. Also there where some
>particularities to
>> > >>>>>>>>> external instance like secondary (replica) shards that
>where
>> not
>> > >>>>>>>>>
>> > >>>>>>>> visible
>> > >>>>>
>> > >>>>>> on embedded instance.
>> > >>>>>>>>>
>> > >>>>>>>>> Besides, I also favor bringing up instances before test
>because
>> > it
>> > >>>>>>>>> allows (amongst other things) to be sure to start on a
>fresh
>> > >>>>>>>>> dataset
>> > >>>>>>>>>
>> > >>>>>>>> for
>> > >>>>>
>> > >>>>>> the test to be deterministic.
>> > >>>>>>>>>
>> > >>>>>>>>> Etienne
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
>> > >>>>>>>>>
>> > >>>>>>>>>> Hi,
>> > >>>>>>>>>>
>> > >>>>>>>>>> I'm excited we're getting lots of discussion going.
>There are
>> > many
>> > >>>>>>>>>> threads
>> > >>>>>>>>>> of conversation here, we may choose to split some of
>them off
>> > >>>>>>>>>> into a
>> > >>>>>>>>>> different email thread. I'm also betting I missed some
>of the
>> > >>>>>>>>>> questions in
>> > >>>>>>>>>> this thread, so apologies ahead of time for that. Also
>> apologies
>> > >>>>>>>>>> for
>> > >>>>>>>>>>
>> > >>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> amount of text, I provided some quick summaries at the top
>of
>> each
>> > >>>>>>>>>> section.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Amit - thanks for your thoughts. I've responded in
>detail
>> below.
>> > >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of
>work
>> > >>>>>>>>>> here to
>> > >>>>>>>>>>
>> > >>>>>>>>> go
>> > >>>>>
>> > >>>>>> around. I'll try and think about how we can divide up some
>next
>> > >>>>>>>>>> steps
>> > >>>>>>>>>> (probably in a separate thread.) The main next step I
>see is
>> > >>>>>>>>>> deciding
>> > >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm
>working
>> on
>> > >>>>>>>>>> that,
>> > >>>>>>>>>>
>> > >>>>>>>>> but
>> > >>>>>>>>
>> > >>>>>>>>> having lots of different thoughts on what the
>> > >>>>>>>>>> advantages/disadvantages
>> > >>>>>>>>>>
>> > >>>>>>>>> of
>> > >>>>>>>>
>> > >>>>>>>>> those are would be helpful (I'm not entirely sure of the
>> > >>>>>>>>>> protocol for
>> > >>>>>>>>>> collaborating on sub-projects like this.)
>> > >>>>>>>>>>
>> > >>>>>>>>>> These issues are all related to what kind of tests we
>want to
>> > >>>>>>>>>> write. I
>> > >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all
>the
>> use
>> > >>>>>>>>>> cases
>> > >>>>>>>>>> we've discussed here (and thus should not block moving
>forward
>> > >>>>>>>>>> with
>> > >>>>>>>>>> this),
>> > >>>>>>>>>> but understanding what we want to test will help us
>understand
>> > >>>>>>>>>> how the
>> > >>>>>>>>>> cluster will be used. I'm working on a proposed user
>guide for
>> > >>>>>>>>>> testing
>> > >>>>>>>>>>
>> > >>>>>>>>> IO
>> > >>>>>>>>
>> > >>>>>>>>> Transforms, and I'm going to send out a link to that + a
>short
>> > >>>>>>>>>> summary
>> > >>>>>>>>>>
>> > >>>>>>>>> to
>> > >>>>>>>>
>> > >>>>>>>>> the list shortly so folks can get a better sense of where
>I'm
>> > >>>>>>>>>> coming
>> > >>>>>>>>>> from.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Here's my thinking on the questions we've raised here -
>> > >>>>>>>>>>
>> > >>>>>>>>>> Embedded versions of data stores for testing
>> > >>>>>>>>>> --------------------
>> > >>>>>>>>>> Summary: yes! But we still need real data stores to test
>> > against.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I am a gigantic fan of using embedded versions of the
>various
>> > data
>> > >>>>>>>>>> stores.
>> > >>>>>>>>>> I think we should test everything we possibly can using
>them,
>> > >>>>>>>>>> and do
>> > >>>>>>>>>>
>> > >>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> majority of our correctness testing using embedded versions
>+ the
>> > >>>>>>>>>>
>> > >>>>>>>>> direct
>> > >>>>>>
>> > >>>>>>> runner. However, it's also important to have at least one
>test
>> that
>> > >>>>>>>>>> actually connects to an actual instance, so we can get
>> coverage
>> > >>>>>>>>>> for
>> > >>>>>>>>>> things
>> > >>>>>>>>>> like credentials, real connection strings, etc...
>> > >>>>>>>>>>
>> > >>>>>>>>>> The key point is that embedded versions definitely can't
>cover
>> > the
>> > >>>>>>>>>> performance tests, so we need to host instances if we
>want to
>> > test
>> > >>>>>>>>>>
>> > >>>>>>>>> that.
>> > >>>>>>
>> > >>>>>>> I consider the integration tests/performance benchmarks to
>be
>> > >>>>>>>>>> costly
>> > >>>>>>>>>> things
>> > >>>>>>>>>> that we do only for the IO transforms with large amounts
>of
>> > >>>>>>>>>> community
>> > >>>>>>>>>> support/usage. A random IO transform used by a few users
>> doesn't
>> > >>>>>>>>>> necessarily need integration & perf tests, but for
>heavily
>> used
>> > IO
>> > >>>>>>>>>> transforms, there's a lot of community value in these
>tests.
>> The
>> > >>>>>>>>>> maintenance proposal below scales with the amount of
>community
>> > >>>>>>>>>> support
>> > >>>>>>>>>> for
>> > >>>>>>>>>> a particular IO transform.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Reusing data stores ("use the data stores across
>executions.")
>> > >>>>>>>>>> ------------------
>> > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently
>used, very
>> > >>>>>>>>>> small
>> > >>>>>>>>>> instances that we keep up all the time + larger
>> multi-container
>> > >>>>>>>>>> data
>> > >>>>>>>>>> store
>> > >>>>>>>>>> instances that we spin up for perf tests.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I don't think we need to have a strong answer to this
>> question,
>> > >>>>>>>>>> but I
>> > >>>>>>>>>> think
>> > >>>>>>>>>> we do need to know what range of capabilities we need,
>and use
>> > >>>>>>>>>> that to
>> > >>>>>>>>>> inform our requirements on the hosting infrastructure. I
>think
>> > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios
>I
>> > discuss
>> > >>>>>>>>>>
>> > >>>>>>>>> below.
>> > >>>>>>
>> > >>>>>>> I had been thinking of a hybrid approach - reuse some
>instances
>> and
>> > >>>>>>>>>>
>> > >>>>>>>>> don't
>> > >>>>>>>>
>> > >>>>>>>>> reuse others. Some tests require isolation from other
>tests
>> (eg.
>> > >>>>>>>>>> performance benchmarking), while others can easily
>re-use the
>> > same
>> > >>>>>>>>>> database/data store instance over time, provided they
>are
>> > >>>>>>>>>> written in
>> > >>>>>>>>>>
>> > >>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> correct manner (eg. a simple read or write correctness
>> integration
>> > >>>>>>>>>>
>> > >>>>>>>>> tests)
>> > >>>>>>>>
>> > >>>>>>>>> To me, the question of whether to use one instance over
>time
>> for
>> > a
>> > >>>>>>>>>> test vs
>> > >>>>>>>>>> spin up an instance for each test comes down to a trade
>off
>> > >>>>>>>>>> between
>> > >>>>>>>>>>
>> > >>>>>>>>> these
>> > >>>>>>>>
>> > >>>>>>>>> factors:
>> > >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super
>flaky,
>> > >>>>>>>>>> we'll
>> > >>>>>>>>>> want to
>> > >>>>>>>>>> keep more instances up and running rather than bring
>them
>> > up/down.
>> > >>>>>>>>>>
>> > >>>>>>>>> (this
>> > >>>>>>
>> > >>>>>>> may also vary by the data store in question)
>> > >>>>>>>>>> 2. Frequency of testing - if we are running tests every
>5
>> > >>>>>>>>>> minutes, it
>> > >>>>>>>>>>
>> > >>>>>>>>> may
>> > >>>>>>>>
>> > >>>>>>>>> be wasteful to bring machines up/down every time. If we
>run
>> > >>>>>>>>>> tests once
>> > >>>>>>>>>>
>> > >>>>>>>>> a
>> > >>>>>>
>> > >>>>>>> day or week, it seems wasteful to keep the machines up the
>whole
>> > >>>>>>>>>> time.
>> > >>>>>>>>>> 3. Isolation requirements - If tests must be isolated,
>it
>> means
>> > we
>> > >>>>>>>>>>
>> > >>>>>>>>> either
>> > >>>>>>>>
>> > >>>>>>>>> have to bring up the instances for each test, or we have
>to
>> have
>> > >>>>>>>>>> some
>> > >>>>>>>>>> sort
>> > >>>>>>>>>> of signaling mechanism to indicate that a given instance
>is in
>> > >>>>>>>>>> use. I
>> > >>>>>>>>>> strongly favor bringing up an instance per test.
>> > >>>>>>>>>> 4. Number/size of containers - if we need a large number
>of
>> > >>>>>>>>>> machines
>> > >>>>>>>>>> for a
>> > >>>>>>>>>> particular test, keeping them running all the time will
>use
>> more
>> > >>>>>>>>>> resources.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> The major unknown to me is how flaky it'll be to spin
>these
>> up.
>> > >>>>>>>>>> I'm
>> > >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up,
>but I
>> > >>>>>>>>>> think the
>> > >>>>>>>>>> best
>> > >>>>>>>>>> way to test that is to start doing it.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I suspect the sweet spot is the following: have a set of
>very
>> > >>>>>>>>>> small
>> > >>>>>>>>>>
>> > >>>>>>>>> data
>> > >>>>>>
>> > >>>>>>> store instances that stay up to support small-data-size
>> post-commit
>> > >>>>>>>>>> end to
>> > >>>>>>>>>> end tests (post-commits run frequently and the data size
>means
>> > the
>> > >>>>>>>>>> instances would not use many resources), combined with
>the
>> > >>>>>>>>>> ability to
>> > >>>>>>>>>> spin
>> > >>>>>>>>>> up larger instances for once a day/week performance
>benchmarks
>> > >>>>>>>>>> (these
>> > >>>>>>>>>>
>> > >>>>>>>>> use
>> > >>>>>>>>
>> > >>>>>>>>> up more resources and are used less frequently.) That's
>the mix
>> > >>>>>>>>>> I'll
>> > >>>>>>>>>> propose in my docs on testing IO transforms.  If
>spinning up
>> new
>> > >>>>>>>>>> instances
>> > >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of
>spinning up
>> > >>>>>>>>>> instances
>> > >>>>>>>>>> for
>> > >>>>>>>>>> each test.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Management ("what's the overhead of managing such a
>> deployment")
>> > >>>>>>>>>> --------------------
>> > >>>>>>>>>> Summary: I propose that anyone can contribute scripts
>for
>> > >>>>>>>>>> setting up
>> > >>>>>>>>>>
>> > >>>>>>>>> data
>> > >>>>>>>>
>> > >>>>>>>>> store instances + integration/perf tests, but if the
>community
>> > >>>>>>>>>> doesn't
>> > >>>>>>>>>> maintain a particular data store's tests, we disable the
>tests
>> > and
>> > >>>>>>>>>> turn off
>> > >>>>>>>>>> the data store instances.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Management of these instances is a crucial question.
>First,
>> > let's
>> > >>>>>>>>>>
>> > >>>>>>>>> break
>> > >>>>>
>> > >>>>>> down what tasks we'll need to do on a recurring basis:
>> > >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both
>instance
>> &
>> > >>>>>>>>>> dependencies) - we don't want to have a lot of old
>versions
>> > >>>>>>>>>> vulnerable
>> > >>>>>>>>>>
>> > >>>>>>>>> to
>> > >>>>>>>>
>> > >>>>>>>>> attacks/buggy
>> > >>>>>>>>>> 2. Investigate breakages/regressions
>> > >>>>>>>>>> (I'm betting there will be more things we'll discover -
>let me
>> > >>>>>>>>>> know if
>> > >>>>>>>>>> you
>> > >>>>>>>>>> have suggestions)
>> > >>>>>>>>>>
>> > >>>>>>>>>> There's a couple goals I see:
>> > >>>>>>>>>> 1. We should only do sys admin work for things that give
>us a
>> > >>>>>>>>>> lot of
>> > >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up
>scripts
>> for
>> > >>>>>>>>>> data
>> > >>>>>>>>>> stores
>> > >>>>>>>>>> without a large community)
>> > >>>>>>>>>> 2. We should do as much as possible of testing via
>> > >>>>>>>>>> in-memory/embedded
>> > >>>>>>>>>> testing (as you brought up).
>> > >>>>>>>>>> 3. Reduce the amount of manual administration overhead
>> > >>>>>>>>>>
>> > >>>>>>>>>> As I discussed above, I think that integration
>> tests/performance
>> > >>>>>>>>>> benchmarks
>> > >>>>>>>>>> are costly things that we should do only for the IO
>transforms
>> > >>>>>>>>>> with
>> > >>>>>>>>>>
>> > >>>>>>>>> large
>> > >>>>>>>>
>> > >>>>>>>>> amounts of community support/usage. Thus, I propose that
>we
>> > >>>>>>>>>> limit the
>> > >>>>>>>>>>
>> > >>>>>>>>> IO
>> > >>>>>>
>> > >>>>>>> transforms that get integration tests & performance
>benchmarks to
>> > >>>>>>>>>>
>> > >>>>>>>>> those
>> > >>>>>
>> > >>>>>> that have community support for maintaining the data store
>> > >>>>>>>>>> instances.
>> > >>>>>>>>>>
>> > >>>>>>>>>> We can enforce this organically using some simple rules:
>> > >>>>>>>>>> 1. Investigating breakages/regressions: if a given
>> > >>>>>>>>>> integration/perf
>> > >>>>>>>>>>
>> > >>>>>>>>> test
>> > >>>>>>
>> > >>>>>>> starts failing and no one investigates it within a set
>period of
>> > >>>>>>>>>> time
>> > >>>>>>>>>>
>> > >>>>>>>>> (a
>> > >>>>>>
>> > >>>>>>> week?), we disable the tests and shut off the data store
>> > >>>>>>>>>> instances if
>> > >>>>>>>>>>
>> > >>>>>>>>> we
>> > >>>>>>
>> > >>>>>>> have instances running. When someone wants to step up and
>> > >>>>>>>>>> support it
>> > >>>>>>>>>> again,
>> > >>>>>>>>>> they can fix the test, check it in, and re-enable the
>test.
>> > >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira
>issue that
>> > >>>>>>>>>> is just
>> > >>>>>>>>>> "is
>> > >>>>>>>>>> the IO Transform X data store up to date?" - if the jira
>is
>> not
>> > >>>>>>>>>> resolved in
>> > >>>>>>>>>> a set period of time (1 month?), the perf/integration
>tests
>> are
>> > >>>>>>>>>>
>> > >>>>>>>>> disabled,
>> > >>>>>>>>
>> > >>>>>>>>> and the data store instances shut off.
>> > >>>>>>>>>>
>> > >>>>>>>>>> This is pretty flexible -
>> > >>>>>>>>>> * If a particular person or organization wants to
>support an
>> IO
>> > >>>>>>>>>> transform,
>> > >>>>>>>>>> they can. If a group of people all organically organize
>to
>> keep
>> > >>>>>>>>>> the
>> > >>>>>>>>>>
>> > >>>>>>>>> tests
>> > >>>>>>>>
>> > >>>>>>>>> running, they can.
>> > >>>>>>>>>> * It can be mostly automated - there's not a lot of
>central
>> > >>>>>>>>>> organizing
>> > >>>>>>>>>> work
>> > >>>>>>>>>> that needs to be done.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Exposing the information about what IO transforms
>currently
>> have
>> > >>>>>>>>>>
>> > >>>>>>>>> running
>> > >>>>>>
>> > >>>>>>> IT/perf benchmarks on the website will let users know what
>IO
>> > >>>>>>>>>>
>> > >>>>>>>>> transforms
>> > >>>>>>
>> > >>>>>>> are well supported.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I like this solution, but I also recognize this is a
>tricky
>> > >>>>>>>>>> problem.
>> > >>>>>>>>>>
>> > >>>>>>>>> This
>> > >>>>>>>>
>> > >>>>>>>>> is something the community needs to be supportive of, so
>I'm
>> > >>>>>>>>>> open to
>> > >>>>>>>>>> other
>> > >>>>>>>>>> thoughts.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Simulating failures in real nodes ("programmatic tests
>to
>> > simulate
>> > >>>>>>>>>> failure")
>> > >>>>>>>>>> -----------------
>> > >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We
>should
>> > >>>>>>>>>> encourage a
>> > >>>>>>>>>> design pattern separating out network/retry logic from
>the
>> main
>> > IO
>> > >>>>>>>>>> transform logic
>> > >>>>>>>>>>
>> > >>>>>>>>>> We *could* create instance failure in any container
>management
>> > >>>>>>>>>>
>> > >>>>>>>>> software
>> > >>>>>
>> > >>>>>> -
>> > >>>>>>>>
>> > >>>>>>>>> we can use their programmatic APIs to determine which
>> containers
>> > >>>>>>>>>> are
>> > >>>>>>>>>> running the instances, and ask them to kill the
>container in
>> > >>>>>>>>>> question.
>> > >>>>>>>>>>
>> > >>>>>>>>> A
>> > >>>>>>
>> > >>>>>>> slow node would be trickier, but I'm sure we could figure
>it out
>> > >>>>>>>>>> - for
>> > >>>>>>>>>> example, add a network proxy that would delay responses.
>> > >>>>>>>>>>
>> > >>>>>>>>>> However, I would argue that this type of testing doesn't
>gain
>> > us a
>> > >>>>>>>>>> lot, and
>> > >>>>>>>>>> is complicated to set up. I think it will be easier to
>test
>> > >>>>>>>>>> network
>> > >>>>>>>>>> errors
>> > >>>>>>>>>> and retry behavior in unit tests for the IO transforms.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Part of the way to handle this is to separate out the
>read
>> code
>> > >>>>>>>>>> from
>> > >>>>>>>>>>
>> > >>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> network code (eg. bigtable has BigtableService). If you put
>the
>> > >>>>>>>>>>
>> > >>>>>>>>> "handle
>> > >>>>>
>> > >>>>>> errors/retry logic" code in a separate MySourceService
>class,
>> > >>>>>>>>>> you can
>> > >>>>>>>>>> test
>> > >>>>>>>>>> MySourceService on the wide variety of networks
>errors/data
>> > store
>> > >>>>>>>>>> problems,
>> > >>>>>>>>>> and then your main IO transform tests focus on the read
>> behavior
>> > >>>>>>>>>> and
>> > >>>>>>>>>> handling the small set of errors the MySourceService
>class
>> will
>> > >>>>>>>>>>
>> > >>>>>>>>> return.
>> > >>>>>
>> > >>>>>> I also think we should focus on testing the IO Transform,
>not
>> > >>>>>>>>>> the data
>> > >>>>>>>>>> store - if we kill a node in a data store, it's that
>data
>> > store's
>> > >>>>>>>>>> problem,
>> > >>>>>>>>>> not beam's problem. As you were pointing out, there are
>a
>> > *large*
>> > >>>>>>>>>> number of
>> > >>>>>>>>>> possible ways that a particular data store can fail, and
>we
>> > >>>>>>>>>> would like
>> > >>>>>>>>>>
>> > >>>>>>>>> to
>> > >>>>>>>>
>> > >>>>>>>>> support many different data stores. Rather than try to
>test
>> that
>> > >>>>>>>>>> each
>> > >>>>>>>>>> data
>> > >>>>>>>>>> store behaves well, we should ensure that we handle
>> > >>>>>>>>>> generic/expected
>> > >>>>>>>>>> errors
>> > >>>>>>>>>> in a graceful manner.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Ismaeal had a couple other quick comments/questions,
>I'll
>> answer
>> > >>>>>>>>>> here
>> > >>>>>>>>>>
>> > >>>>>>>>> -
>> > >>>>>
>> > >>>>>> We can use this to test other runners running on multiple
>> > >>>>>>>>>> machines - I
>> > >>>>>>>>>> agree. This is also necessary for a good performance
>benchmark
>> > >>>>>>>>>> test.
>> > >>>>>>>>>>
>> > >>>>>>>>>> "providing the test machines to mount the cluster" - we
>can
>> > >>>>>>>>>> discuss
>> > >>>>>>>>>>
>> > >>>>>>>>> this
>> > >>>>>>
>> > >>>>>>> further, but one possible option is that google may be
>willing to
>> > >>>>>>>>>>
>> > >>>>>>>>> donate
>> > >>>>>>
>> > >>>>>>> something to support this.
>> > >>>>>>>>>>
>> > >>>>>>>>>> "IO Consistency" - let's follow up on those questions in
>> another
>> > >>>>>>>>>>
>> > >>>>>>>>> thread.
>> > >>>>>>
>> > >>>>>>> That's as much about the public interface we provide to
>users as
>> > >>>>>>>>>>
>> > >>>>>>>>> anything
>> > >>>>>>>>
>> > >>>>>>>>> else. I agree with your sentiment that a user should be
>able to
>> > >>>>>>>>>> expect
>> > >>>>>>>>>> predictable behavior from the different IO transforms.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Thanks for everyone's questions/comments - I really am
>excited
>> > >>>>>>>>>> to see
>> > >>>>>>>>>> that
>> > >>>>>>>>>> people care about this :)
>> > >>>>>>>>>>
>> > >>>>>>>>>> Stephen
>> > >>>>>>>>>>
>> > >>>>>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <
>> [email protected]
>> > >
>> > >>>>>>>>>>
>> > >>>>>>>>> wrote:
>> > >>>>>
>> > >>>>>> Hello,
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> @Stephen Thanks for your proposal, it is really
>interesting,
>> I
>> > >>>>>>>>>>> would
>> > >>>>>>>>>>> really
>> > >>>>>>>>>>> like to help with this. I have never played with
>Kubernetes
>> but
>> > >>>>>>>>>>> this
>> > >>>>>>>>>>> seems
>> > >>>>>>>>>>> a really nice chance to do something useful with it.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> We (at Talend) are testing most of the IOs using simple
>> > container
>> > >>>>>>>>>>>
>> > >>>>>>>>>> images
>> > >>>>>>>>
>> > >>>>>>>>> and in some particular cases ‘clusters’ of containers
>using
>> > >>>>>>>>>>> docker-compose
>> > >>>>>>>>>>> (a little bit like Amit’s (2) proposal). It would be
>really
>> > >>>>>>>>>>> nice to
>> > >>>>>>>>>>>
>> > >>>>>>>>>> have
>> > >>>>>>>>
>> > >>>>>>>>> this at the Beam level, in particular to try to test more
>> complex
>> > >>>>>>>>>>> semantics, I don’t know how programmable kubernetes is
>to
>> > achieve
>> > >>>>>>>>>>> this for
>> > >>>>>>>>>>> example:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Let’s think we have a cluster of Cassandra or Kafka
>nodes, I
>> > >>>>>>>>>>> would
>> > >>>>>>>>>>> like to
>> > >>>>>>>>>>> have programmatic tests to simulate failure (e.g. kill
>a
>> node),
>> > >>>>>>>>>>> or
>> > >>>>>>>>>>> simulate
>> > >>>>>>>>>>> a really slow node, to ensure that the IO behaves as
>expected
>> > >>>>>>>>>>> in the
>> > >>>>>>>>>>> Beam
>> > >>>>>>>>>>> pipeline for the given runner.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Another related idea is to improve IO consistency:
>Today the
>> > >>>>>>>>>>> different IOs
>> > >>>>>>>>>>> have small differences in their failure behavior, I
>really
>> > >>>>>>>>>>> would like
>> > >>>>>>>>>>> to be
>> > >>>>>>>>>>> able to predict with more precision what will happen in
>case
>> of
>> > >>>>>>>>>>>
>> > >>>>>>>>>> errors,
>> > >>>>>>
>> > >>>>>>> e.g. what is the correct behavior if I am writing to a
>Kafka
>> > >>>>>>>>>>> node and
>> > >>>>>>>>>>> there
>> > >>>>>>>>>>> is a network partition, does the Kafka sink retries or
>no ?
>> and
>> > >>>>>>>>>>> what
>> > >>>>>>>>>>> if it
>> > >>>>>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming
>> > >>>>>>>>>>> checkpointing?
>> > >>>>>>>>>>> Or do
>> > >>>>>>>>>>> we guarantee exactly once writes somehow?, today I am
>not
>> sure
>> > >>>>>>>>>>> about
>> > >>>>>>>>>>> what
>> > >>>>>>>>>>> happens (or if the expected behavior depends on the
>runner),
>> > >>>>>>>>>>> but well
>> > >>>>>>>>>>> maybe
>> > >>>>>>>>>>> it is just that I don’t know and we have tests to
>ensure
>> this.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Of course both are really hard problems, but I think
>with
>> your
>> > >>>>>>>>>>> proposal we
>> > >>>>>>>>>>> can try to tackle them, as well as the performance
>ones. And
>> > >>>>>>>>>>> apart of
>> > >>>>>>>>>>> the
>> > >>>>>>>>>>> data stores, I think it will be also really nice to be
>able
>> to
>> > >>>>>>>>>>> test
>> > >>>>>>>>>>>
>> > >>>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> runners in a distributed manner.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> So what is the next step? How do you imagine such
>integration
>> > >>>>>>>>>>> tests?
>> > >>>>>>>>>>> ? Who
>> > >>>>>>>>>>> can provide the test machines so we can mount the
>cluster?
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Maybe my ideas are a bit too far away for an initial
>setup,
>> but
>> > >>>>>>>>>>> it
>> > >>>>>>>>>>> will be
>> > >>>>>>>>>>> really nice to start working on this.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Ismael
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <
>> > >>>>>>>>>>> [email protected]
>> > >>>>>>>>>>> wrote:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Hi Stephen,
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> I was wondering about how we plan to use the data
>stores
>> > across
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> executions.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> Clearly, it's best to setup a new instance (container)
>for
>> > every
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> test,
>> > >>>>>>
>> > >>>>>>> running a "standalone" store (say HBase/Cassandra for
>> > >>>>>>>>>>>> example), and
>> > >>>>>>>>>>>> once
>> > >>>>>>>>>>>> the test is done, teardown the instance. It should
>also be
>> > >>>>>>>>>>>> agnostic
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> to
>> > >>>>>>
>> > >>>>>>> the
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> runtime environment (e.g., Docker on Kubernetes).
>> > >>>>>>>>>>>> I'm wondering though what's the overhead of managing
>such a
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> deployment
>> > >>>>>>
>> > >>>>>>> which could become heavy and complicated as more IOs are
>> > >>>>>>>>>>>> supported
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> and
>> > >>>>>>
>> > >>>>>>> more
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> test cases introduced.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Another way to go would be to have small clusters of
>> different
>> > >>>>>>>>>>>> data
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> stores
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> and run against new "namespaces" (while lazily
>evicting old
>> > >>>>>>>>>>>> ones),
>> > >>>>>>>>>>>> but I
>> > >>>>>>>>>>>> think this is less likely as maintaining a distributed
>> > instance
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> (even
>> > >>>>>
>> > >>>>>> a
>> > >>>>>>>>
>> > >>>>>>>>> small one) for each data store sounds even more complex.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> A third approach would be to to simply have an
>"embedded"
>> > >>>>>>>>>>>> in-memory
>> > >>>>>>>>>>>> instance of a data store as part of a test that runs
>against
>> > it
>> > >>>>>>>>>>>> (such as
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> an
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> embedded Kafka, though not a data store).
>> > >>>>>>>>>>>> This is probably the simplest solution in terms of
>> > >>>>>>>>>>>> orchestration,
>> > >>>>>>>>>>>> but it
>> > >>>>>>>>>>>> depends on having a proper "embedded" implementation
>for an
>> > IO.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Does this make sense to you ? have you considered it ?
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Thanks,
>> > >>>>>>>>>>>> Amit
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> [email protected]
>> > >>>>>
>> > >>>>>> wrote:
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Hi Stephen,
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> as already discussed a bit together, it sounds great
>!
>> > >>>>>>>>>>>>> Especially I
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>> like
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> it as a both integration test platform and good
>coverage for
>> > >>>>>>>>>>>>> IOs.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> I'm very late on this but, as said, I will share with
>you
>> my
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>> Marathon
>> > >>>>>>
>> > >>>>>>> JSON and Mesos docker images.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> By the way, I started to experiment a bit kubernetes
>and
>> > >>>>>>>>>>>>> swamp but
>> > >>>>>>>>>>>>> it's
>> > >>>>>>>>>>>>> not yet complete. I will share what I have on the
>same
>> github
>> > >>>>>>>>>>>>> repo.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Thanks !
>> > >>>>>>>>>>>>> Regards
>> > >>>>>>>>>>>>> JB
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote:
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Hi everyone!
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Currently we have a good set of unit tests for our
>IO
>> > >>>>>>>>>>>>>> Transforms -
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> those
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> tend to run against in-memory versions of the data
>stores.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> However,
>> > >>>>>
>> > >>>>>> we'd
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> like to further increase our test coverage to include
>> > >>>>>>>>>>>>>> running them
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> against
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> real instances of the data stores that the IO
>Transforms
>> > work
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> against
>> > >>>>>>>>
>> > >>>>>>>>> (e.g.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> cassandra, mongodb, kafka, etc…), which means we'll
>need
>> to
>> > >>>>>>>>>>>>>> have
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> real
>> > >>>>>>>>
>> > >>>>>>>>> instances of various data stores.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Additionally, if we want to do performance
>regression
>> > >>>>>>>>>>>>>> detection,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> it's
>> > >>>>>>>>
>> > >>>>>>>>> important to have instances of the services that behave
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> realistically,
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> which isn't true of in-memory or dev versions of the
>> services.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Proposed solution
>> > >>>>>>>>>>>>>> -------------------------
>> > >>>>>>>>>>>>>> If we accept this proposal, we would create an
>> > >>>>>>>>>>>>>> infrastructure for
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> running
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> real instances of data stores inside of containers,
>using
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> container
>> > >>>>>
>> > >>>>>> management software like mesos/marathon, kubernetes, docker
>> > >>>>>>>>>>>>>> swarm,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> etc…
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> to
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> manage the instances.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> This would enable us to build integration tests that
>run
>> > >>>>>>>>>>>>>> against
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> those
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> real
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> instances and performance tests that run against
>those
>> real
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> instances
>> > >>>>>>>>
>> > >>>>>>>>> (like
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> those that Jason Kuster is proposing elsewhere.)
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Why do we need one centralized set of instances vs
>just
>> > having
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> various
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> people host their own instances?
>> > >>>>>>>>>>>>>> -------------------------
>> > >>>>>>>>>>>>>> Reducing flakiness of tests is key. By not having
>> > dependencies
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> from
>> > >>>>>
>> > >>>>>> the
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> core project on external services/instances of data
>stores
>> > >>>>>>>>>>>>>> we have
>> > >>>>>>>>>>>>>> guaranteed access to the services and the group can
>fix
>> > issues
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> that
>> > >>>>>
>> > >>>>>> arise.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> An exception would be something that has an ops team
>> > >>>>>>>>>>>>>> supporting it
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> (eg,
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> AWS, Google Cloud or other professionally managed
>service) -
>> > >>>>>>>>>>>>>> those
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> we
>> > >>>>>>>>
>> > >>>>>>>>> trust
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> will be stable.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> There may be a lot of different data stores needed -
>how
>> > >>>>>>>>>>>>>> will we
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> maintain
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> them?
>> > >>>>>>>>>>>>>> -------------------------
>> > >>>>>>>>>>>>>> It will take work above and beyond that of a normal
>set of
>> > >>>>>>>>>>>>>> unit
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> tests
>> > >>>>>>>>
>> > >>>>>>>>> to
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> build and maintain integration/performance tests &
>their
>> data
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> store
>> > >>>>>
>> > >>>>>> instances.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Setup & maintenance of the data store containers and
>data
>> > >>>>>>>>>>>>>> store
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> instances
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> on it must be automated. It also has to be as simple
>of a
>> > >>>>>>>>>>>>>> setup as
>> > >>>>>>>>>>>>>> possible, and we should avoid hand tweaking the
>> containers -
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> expecting
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> checked in scripts/dockerfiles is key.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Aligned with the community ownership approach of
>Apache,
>> as
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> members
>> > >>>>>
>> > >>>>>> of
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> the
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> community are excited to contribute & maintain those
>tests
>> > >>>>>>>>>>>>>> and the
>> > >>>>>>>>>>>>>> integration/performance tests, people will be able
>to step
>> > >>>>>>>>>>>>>> up and
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> do
>> > >>>>>>
>> > >>>>>>> that.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> If there is no longer support for maintaining a
>particular
>> > >>>>>>>>>>>>>> set of
>> > >>>>>>>>>>>>>> integration & performance tests and their data store
>> > >>>>>>>>>>>>>> instances,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> then
>> > >>>>>>
>> > >>>>>>> we
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> can
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> disable those tests. We may document on the website
>what
>> IO
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Transforms
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> have
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> current integration/performance tests so users know
>what
>> > >>>>>>>>>>>>>> level of
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> testing
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> the various IO Transforms have.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> What about requirements for the container management
>> > software
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> itself?
>> > >>>>>>>>
>> > >>>>>>>>> -------------------------
>> > >>>>>>>>>>>>>> * We should have the data store instances themselves
>in
>> > >>>>>>>>>>>>>> Docker.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Docker
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> allows new instances to be spun up in a quick,
>reproducible
>> > way
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> and
>> > >>>>>
>> > >>>>>> is
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> fairly platform independent. It has wide support from
>a
>> > >>>>>>>>>>>>>> variety of
>> > >>>>>>>>>>>>>> different container management services.
>> > >>>>>>>>>>>>>> * As little admin work required as possible.
>Crashing
>> > >>>>>>>>>>>>>> instances
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> should
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> be
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> restarted, setup should be simple, everything
>possible
>> > >>>>>>>>>>>>>> should be
>> > >>>>>>>>>>>>>> scripted/scriptable.
>> > >>>>>>>>>>>>>> * Logs and test output should be on a publicly
>available
>> > >>>>>>>>>>>>>> website,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> without
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> needing to log into test execution machine.
>Centralized
>> > >>>>>>>>>>>>>> capture of
>> > >>>>>>>>>>>>>> monitoring info/logs from instances running in the
>> > containers
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> would
>> > >>>>>
>> > >>>>>> support
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> this. Ideally, this would just be supported by the
>> container
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> software
>> > >>>>>>>>
>> > >>>>>>>>> out
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> of the box.
>> > >>>>>>>>>>>>>> * It'd be useful to have good persistent volume in
>the
>> > >>>>>>>>>>>>>> container
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> management
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> software so that databases don't have to reload
>large data
>> > >>>>>>>>>>>>>> sets
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> every
>> > >>>>>>>>
>> > >>>>>>>>> time.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> * The containers may be a place to execute runners
>> > >>>>>>>>>>>>>> themselves if
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> we
>> > >>>>>
>> > >>>>>> need
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> larger runner instances, so it should play well with
>Spark,
>> > >>>>>>>>>>>>>> Flink,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> etc…
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> As I discussed earlier on the mailing list, it looks
>like
>> > >>>>>>>>>>>>>> hosting
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> docker
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> containers on kubernetes, docker swarm or
>mesos+marathon
>> > >>>>>>>>>>>>>> would be
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> a
>> > >>>>>
>> > >>>>>> good
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> solution.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Thanks,
>> > >>>>>>>>>>>>>> Stephen Sisk
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> --
>> > >>>>>>>>>>>>> Jean-Baptiste Onofré
>> > >>>>>>>>>>>>> [email protected]
>> > >>>>>>>>>>>>> http://blog.nanthrax.net
>> > >>>>>>>>>>>>> Talend - http://www.talend.com
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> --
>> > >>>>>>>> Jean-Baptiste Onofré
>> > >>>>>>>> [email protected]
>> > >>>>>>>> http://blog.nanthrax.net
>> > >>>>>>>> Talend - http://www.talend.com
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>
>> > >
>> > >
>> >
>>

Re: Hosting data stores for IO Transform testing

Reply via email to