Re: Hosting data stores for IO Transform testing

Stephen Sisk Wed, 18 Jan 2017 10:04:04 -0800

Hi Ishmael,

these are good questions, thanks for raising them.


Ability to modify network/compute resources to simulate failures
=================================================
I see two real questions here:
1. Is this something we want to do?
2. Is it possible with both/either?

So far, the test strategy I've been advocating is that we test problems
like this in unit tests rather than do this in ITs/Perf tests. Otherwise,
it's hard to re-create the same conditions.

I can investigate whether it's possible, but I want to clarify whether this
is something that we care about. I know both support killing individual
nodes. I haven't seen a lot of network control in either, but haven't tried
to look for it.

Availability of ready to play packages
============================
I did look at this, and as far as I could tell, mesos didn't have any
pre-built packages for multi-node clusters of data stores. If there's a
good repository of them that we trust, that would definitely save us time.
Can you point me at the mesos repository?

S



On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré <[email protected]>
wrote:

> ⁣Hi Ismael
>
> Stephen will reply with details but I know he did a comparison and
> evaluate different options.
>
> He tested with the jdbc Io itests.
>
> Regards
> JB
>
> On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <[email protected]>
> wrote:
> >Thanks for your analysis Stephen, good arguments / references.
> >
> >One quick question. Have you checked the APIs of both
> >(Mesos/Kubernetes) to
> >see
> >if we can do programmatically do more complex tests (I suppose so, but
> >you
> >don't mention how easy or if those are possible), for example to
> >simulate a
> >slow networking slave (to test stragglers), or to arbitrarily kill one
> >slave (e.g. if I want to test the correct behavior of a runner/IO that
> >is
> >reading from it) ?
> >
> >Other missing point in the review is the availability of ready to play
> >packages,
> >I think in this area mesos/dcos seems more advanced no? I haven't
> >looked
> >recently but at least 6 months ago there were not many helm packages
> >ready
> >for
> >example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
> >etc). Has
> >this been improved ? because preparing this also is a considerable
> >amount of
> >work on the other hand this could be also a chance to contribute to
> >kubernetes.
> >
> >Regards,
> >Ismaël
> >
> >
> >
> >On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <[email protected]>
> >wrote:
> >
> >> hi!
> >>
> >> I've been continuing this investigation, and have some more info to
> >report,
> >> and hopefully we can start making some decisions.
> >>
> >> To support performance testing, I've been investigating
> >mesos+marathon and
> >> kubernetes for running data stores in their high availability mode. I
> >have
> >> been examining features that kubernetes/mesos+marathon use to support
> >this.
> >>
> >> Setting up a multi-node cluster in a high availability mode tends to
> >be
> >> more expensive time-wise than the single node instances I've played
> >around
> >> with in the past. Rather than do a full build out with both
> >kubernetes and
> >> mesos, I'd like to pick one of the two options to build the prototype
> >> cluster with. If the prototype doesn't go well, we could still go
> >back to
> >> the other option, but I'd like to change us from a mode of "let's
> >look at
> >> all the options" to one of "here's the favorite, let's prove that
> >works for
> >> us".
> >>
> >> Below are the features that I've seen are important to multi-node
> >instances
> >> of data stores. I'm sure other folks on the list have done this
> >before, so
> >> feel free to pipe up if I'm missing a good solution to a problem.
> >>
> >> DNS/Discovery
> >>
> >> --------------------
> >>
> >> Necessary for talking between nodes (eg, cassandra nodes all need to
> >be
> >> able to talk to a set of seed nodes.)
> >>
> >> * Kubernetes has built-in DNS/discovery between nodes.
> >>
> >> * Mesos has supports this via mesos-dns, which isn't a part of core
> >mesos,
> >> but is in dcos, which is the mesos distribution I've been using and
> >that I
> >> would expect us to use.
> >>
> >> Instances properly distributed across nodes
> >>
> >> ------------------------------------------------------------
> >>
> >> If multiple instances of a data source end up on the same underlying
> >VM, we
> >> may not get good performance out of those instances since the
> >underlying VM
> >> may be more taxed than other VMs.
> >>
> >> * Kubernetes has a beta feature StatefulSets[1] which allow for
> >containers
> >> distributed so that there's one container per underlying machine (as
> >well
> >> as a lot of other useful features like easy stable dns names.)
> >>
> >> * Mesos can support this via the built in UNIQUE constraint [2]
> >>
> >> Load balancing
> >>
> >> --------------------
> >>
> >> Incoming requests from users need to be distributed to the various
> >machines
> >> - this is important for many data stores' high availability modes.
> >>
> >> * Kubernetes supports easily hooking up to an external load balancer
> >when
> >> on a cloud (and can be configured to work with a built-in load
> >balancer if
> >> not)
> >>
> >> * Mesos supports this via marathon-lb [3], which is an install-able
> >package
> >> in DC/OS
> >>
> >> Persistent Volumes tied to specific instances
> >>
> >> ------------------------------------------------------------
> >>
> >> Databases often need persistent state (for example to store the data
> >:), so
> >> it's an important part of running our service.
> >>
> >> * Kubernetes StatefulSets supports this
> >>
> >> * Mesos+marathon apps with persistent volumes supports this [4] [5]
> >>
> >> As I mentioned above, I'd like to focus on either kubernetes or mesos
> >for
> >> my investigation, and as I go further along, I'm seeing kubernetes as
> >> better suited to our needs.
> >>
> >> (1) It supports more of the features we want out of the box and with
> >> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
> >> requires marathon-lb to be installed and mesos-dns to be configured.
> >>
> >> (2) I'm also finding that there seem to be more examples of using
> >> kubernetes to solve the types of problems we're working on. This is
> >> somewhat subjective, but in my experience as I've tried to learn both
> >> kubernetes and mesos, I personally found it generally easier to get
> >> kubernetes running than mesos due to the tutorials/examples available
> >for
> >> kubernetes.
> >>
> >> (3) Lower cost of initial setup - as I discussed in a previous
> >mail[6],
> >> kubernetes was far easier to get set up even when I knew the exact
> >steps.
> >> Mesos took me around 27 steps [7], which involved a lot of config
> >that was
> >> easy to get wrong (it took me about 5 tries to get all the steps
> >correct in
> >> one go.) Kubernetes took me around 8 steps and very little config.
> >>
> >> Given that, I'd like to focus my investigation/prototyping on
> >Kubernetes.
> >> To
> >> be clear, it's fairly close and I think both Mesos and Kubernetes
> >could
> >> support what we need, so if we run into issues with kubernetes, Mesos
> >still
> >> seems like a viable option that we could fall back to.
> >>
> >> Thanks,
> >> Stephen
> >>
> >>
> >> [1] Kubernetes StatefulSets
> >>
> >
> https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/
> >>
> >> [2] mesos unique constraint -
> >> https://mesosphere.github.io/marathon/docs/constraints.html
> >>
> >> [3]
> >> https://mesosphere.github.io/marathon/docs/service-
> >> discovery-load-balancing.html
> >>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
> >>
> >> [4]
> >https://mesosphere.github.io/marathon/docs/persistent-volumes.html
> >>
> >> [5]
> >https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
> >>
> >> [6] Container Orchestration software for hosting data stores
> >> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
> >> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
> >>
> >> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
> >>
> >>
> >> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <[email protected]>
> >wrote:
> >>
> >> > Just a quick drive-by comment: how tests are laid out has
> >non-trivial
> >> > tradeoffs on how/where continuous integration runs, and how results
> >are
> >> > integrated into the tooling. The current state is certainly not
> >ideal
> >> > (e.g., due to multiple test executions some links in Jenkins point
> >where
> >> > they shouldn't), but most other alternatives had even bigger
> >drawbacks at
> >> > the time. If someone has great ideas that don't explode the number
> >of
> >> > modules, please share ;-)
> >> >
> >> > On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
> ><[email protected]>
> >> > wrote:
> >> >
> >> > > Hi Stephen,
> >> > >
> >> > > Thanks for taking the time to comment.
> >> > >
> >> > > My comments are bellow in the email:
> >> > >
> >> > >
> >> > > Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
> >> > >
> >> > >> hey Etienne -
> >> > >>
> >> > >> thanks for your thoughts and thanks for sharing your
> >experiences. I
> >> > >> generally agree with what you're saying. Quick comments below:
> >> > >>
> >> > >> IT are stored alongside with UT in src/test directory of the IO
> >but
> >> they
> >> > >>>
> >> > >> might go to dedicated module, waiting for a consensus
> >> > >> I don't have a strong opinion or feel that I've worked enough
> >with
> >> maven
> >> > >> to
> >> > >> understand all the consequences - I'd love for someone with more
> >maven
> >> > >> experience to weigh in. If this becomes blocking, I'd say check
> >it in,
> >> > and
> >> > >> we can refactor later if it proves problematic.
> >> > >>
> >> > > Sure, not a blocking point, it could be refactored afterwards.
> >Just as
> >> a
> >> > > reminder, JB mentioned that storing IT in separate module allows
> >to
> >> have
> >> > > more coherence between all IT (same behavior) and to do cross IO
> >> > > integration tests. JB, have you experienced some long term
> >drawbacks of
> >> > > storing IT in a separate module, like, for example, more
> >difficult
> >> > > maintenance due to "distance" with production code?
> >> > >
> >> > >
> >> > >>   Also IMHO, it is better that tests load/clean data than doing
> >some
> >> > >>>
> >> > >> assumptions about the running order of the tests.
> >> > >> I definitely agree that we don't want to make assumptions about
> >the
> >> > >> running
> >> > >> order of the tests - that way lies pain. :) It will be
> >interesting to
> >> > see
> >> > >> how the performance tests work out since they will need more
> >data (and
> >> > >> thus
> >> > >> loading data can take much longer.)
> >> > >>
> >> > > Yes, performance testing might push in the direction of data
> >loading
> >> from
> >> > > outside the tests due to loading time.
> >> > >
> >> > >>   This should also be an easier problem
> >> > >> for read tests than for write tests - if we have long running
> >> instances,
> >> > >> read tests don't really need cleanup. And if write tests only
> >write a
> >> > >> small
> >> > >> amount of data, as long as we are sure we're writing to uniquely
> >> > >> identifiable locations (ie, new table per test or something
> >similar),
> >> we
> >> > >> can clean up the write test data on a slower schedule.
> >> > >>
> >> > > I agree
> >> > >
> >> > >>
> >> > >> this will tend to go to the direction of long running data store
> >> > >>>
> >> > >> instances rather than data store instances started (and
> >optionally
> >> > loaded)
> >> > >> before tests.
> >> > >> It may be easiest to start with a "data stores stay running"
> >> > >> implementation, and then if we see issues with that move towards
> >tests
> >> > >> that
> >> > >> start/stop the data stores on each run. One thing I'd like to
> >make
> >> sure
> >> > is
> >> > >> that we're not manually tweaking the configurations for data
> >stores.
> >> One
> >> > >> way we could do that is to destroy/recreate the data stores on a
> >> slower
> >> > >> schedule - maybe once per week. That way if the script is
> >changed or
> >> the
> >> > >> data store instances are changed, we'd be able to detect it
> >relatively
> >> > >> soon
> >> > >> while still removing the need for the tests to manage the data
> >stores.
> >> > >>
> >> > > I agree. In addition to configuration manual tweaking, there
> >might be
> >> > > cases in which a data store re-partition data during a test or
> >after
> >> some
> >> > > tests while the dataset changes. The IO must be tolerant to that
> >but
> >> the
> >> > > asserts (number of bundles for example) in test must not fail in
> >that
> >> > case.
> >> > > I would also prefer if possible that the tests do not manage data
> >> stores
> >> > > (not setup them, not start them, not stop them)
> >> > >
> >> > >
> >> > >> as a general note, I suspect many of the folks in the states
> >will be
> >> on
> >> > >> holiday until Jan 2nd/3rd.
> >> > >>
> >> > >> S
> >> > >>
> >> > >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
> ><[email protected]
> >> >
> >> > >> wrote:
> >> > >>
> >> > >> Hi,
> >> > >>>
> >> > >>> Recently we had a discussion about integration tests of IOs.
> >I'm
> >> > >>> preparing a PR for integration tests of the elasticSearch IO
> >> > >>> (
> >> > >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
> >> > >>> LASTICSEARCH-IO
> >> > >>> as a first shot) which are very important IMHO because they
> >helped
> >> > catch
> >> > >>> some bugs that UT could not (volume, data store instance
> >sharing,
> >> real
> >> > >>> data store instance ...)
> >> > >>>
> >> > >>> I would like to have your thoughts/remarks about points bellow.
> >Some
> >> of
> >> > >>> these points are also discussed here
> >> > >>>
> >> > >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >> > >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> >> > >>> :
> >> > >>>
> >> > >>> - UT and IT have a similar architecture, but while UT focus on
> >> testing
> >> > >>> the correct behavior of the code including corner cases and use
> >> > embedded
> >> > >>> in memory data store, IT assume that the behavior is correct
> >(strong
> >> > UT)
> >> > >>> and focus on higher volume testing and testing against real
> >data
> >> store
> >> > >>> instance(s)
> >> > >>>
> >> > >>> - For now, IT are stored alongside with UT in src/test
> >directory of
> >> the
> >> > >>> IO but they might go to dedicated module, waiting for a
> >consensus.
> >> > Maven
> >> > >>> is not configured to run them automatically because data store
> >is not
> >> > >>> available on jenkins server yet
> >> > >>>
> >> > >>> - For now, they only use DirectRunner, but they will  be run
> >against
> >> > >>> each runner.
> >> > >>>
> >> > >>> - IT do not setup data store instance (like stated in the above
> >> > >>> document) they assume that one is already running (hardcoded
> >> > >>> configuration in test for now, waiting for a common solution to
> >pass
> >> > >>> configuration to IT). A docker container script is provided in
> >the
> >> > >>> contrib directory as a starting point to whatever orchestration
> >> > software
> >> > >>> will be chosen.
> >> > >>>
> >> > >>> - IT load and clean test data before and after each test if
> >needed.
> >> It
> >> > >>> is simpler to do so because some tests need empty data store
> >(write
> >> > >>> test) and because, as discussed in the document, tests might
> >not be
> >> the
> >> > >>> only users of the data store. Also IMHO, it is better that
> >tests
> >> > >>> load/clean data than doing some assumptions about the running
> >order
> >> of
> >> > >>> the tests.
> >> > >>>
> >> > >>> If we generalize this pattern to all IT tests, this will tend
> >to go
> >> to
> >> > >>> the direction of long running data store instances rather than
> >data
> >> > >>> store instances started (and optionally loaded) before tests.
> >> > >>>
> >> > >>> Besides if we where to change our minds and load data from
> >outside
> >> the
> >> > >>> tests, a logstash script is provided.
> >> > >>>
> >> > >>> If you have any thoughts or remarks I'm all ears :)
> >> > >>>
> >> > >>> Regards,
> >> > >>>
> >> > >>> Etienne
> >> > >>>
> >> > >>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> >> > >>>
> >> > >>>> Hi Stephen,
> >> > >>>>
> >> > >>>> the purpose of having in a specific module is to share
> >resources and
> >> > >>>> apply the same behavior from IT perspective and be able to
> >have IT
> >> > >>>> "cross" IO (for instance, reading from JMS and sending to
> >Kafka, I
> >> > >>>> think that's the key idea for integration tests).
> >> > >>>>
> >> > >>>> For instance, in Karaf, we have:
> >> > >>>> - utest in each module
> >> > >>>> - itest module containing itests for all modules all together
> >> > >>>>
> >> > >>>> Regards
> >> > >>>> JB
> >> > >>>>
> >> > >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> >> > >>>>
> >> > >>>>> Hi Etienne,
> >> > >>>>>
> >> > >>>>> thanks for following up and answering my questions.
> >> > >>>>>
> >> > >>>>> re: where to store integration tests - having them all in a
> >> separate
> >> > >>>>> module
> >> > >>>>> is an interesting idea. I couldn't find JB's comments about
> >moving
> >> > them
> >> > >>>>> into a separate module in the PR - can you share the reasons
> >for
> >> > >>>>> doing so?
> >> > >>>>> The IO integration/perf tests so it does seem like they'll
> >need to
> >> be
> >> > >>>>> treated in a special manner, but given that there is already
> >an IO
> >> > >>>>> specific
> >> > >>>>> module, it may just be that we need to treat all the ITs in
> >the IO
> >> > >>>>> module
> >> > >>>>> the same way. I don't have strong opinions either way right
> >now.
> >> > >>>>>
> >> > >>>>> S
> >> > >>>>>
> >> > >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
> >> > [email protected]>
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>> Hi guys,
> >> > >>>>>
> >> > >>>>> @Stephen: I addressed all your comments directly in the PR,
> >thanks!
> >> > >>>>> I just wanted to comment here about the docker image I used:
> >the
> >> only
> >> > >>>>> official Elastic image contains only ElasticSearch. But for
> >> testing I
> >> > >>>>> needed logstash (for ingestion) and kibana (not for
> >integration
> >> > tests,
> >> > >>>>> but to easily test REST requests to ES using sense). This is
> >why I
> >> > use
> >> > >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
> >isreleased
> >> > >>>>> under
> >> > >>>>> theapache 2 license.
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> Besides, there is also a point about where to store
> >integration
> >> > tests:
> >> > >>>>> JB proposed in the PR to store integration tests to dedicated
> >> module
> >> > >>>>> rather than directly in the IO module (like I did).
> >> > >>>>>
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> Etienne
> >> > >>>>>
> >> > >>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> >> > >>>>>
> >> > >>>>>> hey!
> >> > >>>>>>
> >> > >>>>>> thanks for sending this. I'm very excited to see this
> >change. I
> >> > >>>>>> added some
> >> > >>>>>> detail-oriented code review comments in addition to what
> >I've
> >> > >>>>>> discussed
> >> > >>>>>> here.
> >> > >>>>>>
> >> > >>>>>> The general goal is to allow for re-usable instantiation of
> >> > particular
> >> > >>>>>>
> >> > >>>>> data
> >> > >>>>>
> >> > >>>>>> store instances and this seems like a good start. Looks like
> >you
> >> > >>>>>> also have
> >> > >>>>>> a script to generate test data for your tests - that's
> >great.
> >> > >>>>>>
> >> > >>>>>> The next steps (definitely not blocking your work) will be
> >to have
> >> > >>>>>> ways to
> >> > >>>>>> create instances from the docker images you have here, and
> >use
> >> them
> >> > >>>>>> in the
> >> > >>>>>> tests. We'll need support in the test framework for that
> >since
> >> it'll
> >> > >>>>>> be
> >> > >>>>>> different on developer machines and in the beam jenkins
> >cluster,
> >> but
> >> > >>>>>> your
> >> > >>>>>> scripts here allow someone running these tests locally to
> >not have
> >> > to
> >> > >>>>>>
> >> > >>>>> worry
> >> > >>>>>
> >> > >>>>>> about getting the instance set up and can manually adjust,
> >so this
> >> > is
> >> > >>>>>> a
> >> > >>>>>> good incremental step.
> >> > >>>>>>
> >> > >>>>>> I have some thoughts now that I'm reviewing your scripts
> >(that I
> >> > >>>>>> didn't
> >> > >>>>>> have previously, so we are learning this together):
> >> > >>>>>> * It may be useful to try and document why we chose a
> >particular
> >> > >>>>>> docker
> >> > >>>>>> image as the base (ie, "this is the official supported
> >elastic
> >> > search
> >> > >>>>>> docker image" or "this image has several data stores
> >together that
> >> > >>>>>> can be
> >> > >>>>>> used for a couple different tests")  - I'm curious as to
> >whether
> >> the
> >> > >>>>>> community thinks that is important
> >> > >>>>>>
> >> > >>>>>> One thing that I called out in the comment that's worth
> >mentioning
> >> > >>>>>> on the
> >> > >>>>>> larger list - if you want to specify which specific runners
> >a test
> >> > >>>>>> uses,
> >> > >>>>>> that can be controlled in the pom for the module. I updated
> >the
> >> > >>>>>> testing
> >> > >>>>>>
> >> > >>>>> doc
> >> > >>>>>
> >> > >>>>>> mentioned previously in this thread with a TODO to talk
> >about this
> >> > >>>>>> more. I
> >> > >>>>>> think we should also make it so that IO modules have that
> >> > >>>>>> automatically,
> >> > >>>>>>
> >> > >>>>> so
> >> > >>>>>
> >> > >>>>>> developers don't have to worry about it.
> >> > >>>>>>
> >> > >>>>>> S
> >> > >>>>>>
> >> > >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
> >> > [email protected]>
> >> > >>>>>>
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>>> Stephen,
> >> > >>>>>>
> >> > >>>>>> As discussed, I added injection script, docker containers
> >scripts
> >> > and
> >> > >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
> >> > >>>>>> <
> >> > >>>>>>
> >> > >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
> >> > >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
> >> > >>> d824cefcb3ed0b9
> >> > >>>
> >> > >>>> directory in that PR:
> >> > >>>>>> https://github.com/apache/incubator-beam/pull/1439.
> >> > >>>>>>
> >> > >>>>>> These work well but they are first shot. Do you have any
> >comments
> >> > >>>>>> about
> >> > >>>>>> those?
> >> > >>>>>>
> >> > >>>>>> Besides I am not very sure that these files should be in the
> >IO
> >> > itself
> >> > >>>>>> (even in contrib directory, out of maven source
> >directories). Any
> >> > >>>>>>
> >> > >>>>> thoughts?
> >> > >>>>>
> >> > >>>>>> Thanks,
> >> > >>>>>>
> >> > >>>>>> Etienne
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> >> > >>>>>>
> >> > >>>>>>> It's great to hear more experiences.
> >> > >>>>>>>
> >> > >>>>>>> I'm also glad to hear that people see real value in the
> >high
> >> > >>>>>>> volume/performance benchmark tests. I tried to capture that
> >in
> >> the
> >> > >>>>>>>
> >> > >>>>>> Testing
> >> > >>>>>
> >> > >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> >> > >>>>>>>
> >> > >>>>>>> It does generally sound like we're in agreement here. Areas
> >of
> >> > >>>>>>> discussion
> >> > >>>>>>>
> >> > >>>>>> I
> >> > >>>>>>
> >> > >>>>>>> see:
> >> > >>>>>>> 1.  People like the idea of bringing up fresh instances for
> >each
> >> > test
> >> > >>>>>>> rather than keeping instances running all the time, since
> >that
> >> > >>>>>>> ensures no
> >> > >>>>>>> contamination between tests. That seems reasonable to me.
> >If we
> >> see
> >> > >>>>>>> flakiness in the tests or we note that setting up/tearing
> >down
> >> > >>>>>>> instances
> >> > >>>>>>>
> >> > >>>>>> is
> >> > >>>>>>
> >> > >>>>>>> taking a lot of time,
> >> > >>>>>>> 2. Deciding on cluster management software/orchestration
> >software
> >> > - I
> >> > >>>>>>>
> >> > >>>>>> want
> >> > >>>>>
> >> > >>>>>> to make sure we land on the right tool here since choosing
> >the
> >> > >>>>>>> wrong tool
> >> > >>>>>>> could result in administration of the instances taking more
> >> work. I
> >> > >>>>>>>
> >> > >>>>>> suspect
> >> > >>>>>>
> >> > >>>>>>> that's a good place for a follow up discussion, so I'll
> >start a
> >> > >>>>>>> separate
> >> > >>>>>>> thread on that. I'm happy with whatever tool we choose, but
> >I
> >> want
> >> > to
> >> > >>>>>>>
> >> > >>>>>> make
> >> > >>>>>
> >> > >>>>>> sure we take a moment to consider different options and have
> >a
> >> > >>>>>>> reason for
> >> > >>>>>>> choosing one.
> >> > >>>>>>>
> >> > >>>>>>> Etienne - thanks for being willing to port your
> >creation/other
> >> > >>>>>>> scripts
> >> > >>>>>>> over. You might be a good early tester of whether this
> >system
> >> works
> >> > >>>>>>> well
> >> > >>>>>>> for everyone.
> >> > >>>>>>>
> >> > >>>>>>> Stephen
> >> > >>>>>>>
> >> > >>>>>>> [1]  Reasons for Beam Test Strategy -
> >> > >>>>>>>
> >> > >>>>>>>
> >https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >> > >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
> >> > >>>
> >> > >>>>
> >> > >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
> >> > >>>>>>> <[email protected]>
> >> > >>>>>>> wrote:
> >> > >>>>>>>
> >> > >>>>>>> I second Etienne there.
> >> > >>>>>>>>
> >> > >>>>>>>> We worked together on the ElasticsearchIO and definitely,
> >the
> >> high
> >> > >>>>>>>> valuable test we did were integration tests with ES on
> >docker
> >> and
> >> > >>>>>>>> high
> >> > >>>>>>>> volume.
> >> > >>>>>>>>
> >> > >>>>>>>> I think we have to distinguish the two kinds of tests:
> >> > >>>>>>>> 1. utests are located in the IO itself and basically they
> >should
> >> > >>>>>>>> cover
> >> > >>>>>>>> the core behaviors of the IO
> >> > >>>>>>>> 2. itests are located as contrib in the IO (they could be
> >part
> >> of
> >> > >>>>>>>> the IO
> >> > >>>>>>>> but executed by the integration-test plugin or a specific
> >> profile)
> >> > >>>>>>>> that
> >> > >>>>>>>> deals with "real" backend and high volumes. The resources
> >> required
> >> > >>>>>>>> by
> >> > >>>>>>>> the itest can be bootstrapped by Jenkins (for instance
> >using
> >> > >>>>>>>> Mesos/Marathon and docker images as already discussed, and
> >it's
> >> > >>>>>>>> what I'm
> >> > >>>>>>>> doing on my own "server").
> >> > >>>>>>>>
> >> > >>>>>>>> It's basically what Stephen described.
> >> > >>>>>>>>
> >> > >>>>>>>> We have to not relay only on itest: utests are very
> >important
> >> and
> >> > >>>>>>>> they
> >> > >>>>>>>> validate the core behavior.
> >> > >>>>>>>>
> >> > >>>>>>>> My $0.01 ;)
> >> > >>>>>>>>
> >> > >>>>>>>> Regards
> >> > >>>>>>>> JB
> >> > >>>>>>>>
> >> > >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> Hi Stephen,
> >> > >>>>>>>>>
> >> > >>>>>>>>> I like your proposition very much and I also agree that
> >docker
> >> +
> >> > >>>>>>>>> some
> >> > >>>>>>>>> orchestration software would be great !
> >> > >>>>>>>>>
> >> > >>>>>>>>> On the elasticsearchIO (PR to be created this week) there
> >is
> >> > docker
> >> > >>>>>>>>> container creation scripts and logstash data ingestion
> >script
> >> for
> >> > >>>>>>>>> IT
> >> > >>>>>>>>> environment available in contrib directory alongside with
> >> > >>>>>>>>> integration
> >> > >>>>>>>>> tests themselves. I'll be happy to make them compliant to
> >new
> >> IT
> >> > >>>>>>>>> environment.
> >> > >>>>>>>>>
> >> > >>>>>>>>> What you say bellow about the need for external IT
> >environment
> >> is
> >> > >>>>>>>>> particularly true. As an example with ES what came out in
> >first
> >> > >>>>>>>>> implementation was that there were problems starting at
> >some
> >> high
> >> > >>>>>>>>>
> >> > >>>>>>>> volume
> >> > >>>>>
> >> > >>>>>> of data (timeouts, ES windowing overflow...) that could not
> >have
> >> be
> >> > >>>>>>>>>
> >> > >>>>>>>> seen
> >> > >>>>>
> >> > >>>>>> on embedded ES version. Also there where some
> >particularities to
> >> > >>>>>>>>> external instance like secondary (replica) shards that
> >where
> >> not
> >> > >>>>>>>>>
> >> > >>>>>>>> visible
> >> > >>>>>
> >> > >>>>>> on embedded instance.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Besides, I also favor bringing up instances before test
> >because
> >> > it
> >> > >>>>>>>>> allows (amongst other things) to be sure to start on a
> >fresh
> >> > >>>>>>>>> dataset
> >> > >>>>>>>>>
> >> > >>>>>>>> for
> >> > >>>>>
> >> > >>>>>> the test to be deterministic.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Etienne
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
> >> > >>>>>>>>>
> >> > >>>>>>>>>> Hi,
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I'm excited we're getting lots of discussion going.
> >There are
> >> > many
> >> > >>>>>>>>>> threads
> >> > >>>>>>>>>> of conversation here, we may choose to split some of
> >them off
> >> > >>>>>>>>>> into a
> >> > >>>>>>>>>> different email thread. I'm also betting I missed some
> >of the
> >> > >>>>>>>>>> questions in
> >> > >>>>>>>>>> this thread, so apologies ahead of time for that. Also
> >> apologies
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> amount of text, I provided some quick summaries at the top
> >of
> >> each
> >> > >>>>>>>>>> section.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Amit - thanks for your thoughts. I've responded in
> >detail
> >> below.
> >> > >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of
> >work
> >> > >>>>>>>>>> here to
> >> > >>>>>>>>>>
> >> > >>>>>>>>> go
> >> > >>>>>
> >> > >>>>>> around. I'll try and think about how we can divide up some
> >next
> >> > >>>>>>>>>> steps
> >> > >>>>>>>>>> (probably in a separate thread.) The main next step I
> >see is
> >> > >>>>>>>>>> deciding
> >> > >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm
> >working
> >> on
> >> > >>>>>>>>>> that,
> >> > >>>>>>>>>>
> >> > >>>>>>>>> but
> >> > >>>>>>>>
> >> > >>>>>>>>> having lots of different thoughts on what the
> >> > >>>>>>>>>> advantages/disadvantages
> >> > >>>>>>>>>>
> >> > >>>>>>>>> of
> >> > >>>>>>>>
> >> > >>>>>>>>> those are would be helpful (I'm not entirely sure of the
> >> > >>>>>>>>>> protocol for
> >> > >>>>>>>>>> collaborating on sub-projects like this.)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> These issues are all related to what kind of tests we
> >want to
> >> > >>>>>>>>>> write. I
> >> > >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all
> >the
> >> use
> >> > >>>>>>>>>> cases
> >> > >>>>>>>>>> we've discussed here (and thus should not block moving
> >forward
> >> > >>>>>>>>>> with
> >> > >>>>>>>>>> this),
> >> > >>>>>>>>>> but understanding what we want to test will help us
> >understand
> >> > >>>>>>>>>> how the
> >> > >>>>>>>>>> cluster will be used. I'm working on a proposed user
> >guide for
> >> > >>>>>>>>>> testing
> >> > >>>>>>>>>>
> >> > >>>>>>>>> IO
> >> > >>>>>>>>
> >> > >>>>>>>>> Transforms, and I'm going to send out a link to that + a
> >short
> >> > >>>>>>>>>> summary
> >> > >>>>>>>>>>
> >> > >>>>>>>>> to
> >> > >>>>>>>>
> >> > >>>>>>>>> the list shortly so folks can get a better sense of where
> >I'm
> >> > >>>>>>>>>> coming
> >> > >>>>>>>>>> from.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Here's my thinking on the questions we've raised here -
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Embedded versions of data stores for testing
> >> > >>>>>>>>>> --------------------
> >> > >>>>>>>>>> Summary: yes! But we still need real data stores to test
> >> > against.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I am a gigantic fan of using embedded versions of the
> >various
> >> > data
> >> > >>>>>>>>>> stores.
> >> > >>>>>>>>>> I think we should test everything we possibly can using
> >them,
> >> > >>>>>>>>>> and do
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> majority of our correctness testing using embedded versions
> >+ the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> direct
> >> > >>>>>>
> >> > >>>>>>> runner. However, it's also important to have at least one
> >test
> >> that
> >> > >>>>>>>>>> actually connects to an actual instance, so we can get
> >> coverage
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>> things
> >> > >>>>>>>>>> like credentials, real connection strings, etc...
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> The key point is that embedded versions definitely can't
> >cover
> >> > the
> >> > >>>>>>>>>> performance tests, so we need to host instances if we
> >want to
> >> > test
> >> > >>>>>>>>>>
> >> > >>>>>>>>> that.
> >> > >>>>>>
> >> > >>>>>>> I consider the integration tests/performance benchmarks to
> >be
> >> > >>>>>>>>>> costly
> >> > >>>>>>>>>> things
> >> > >>>>>>>>>> that we do only for the IO transforms with large amounts
> >of
> >> > >>>>>>>>>> community
> >> > >>>>>>>>>> support/usage. A random IO transform used by a few users
> >> doesn't
> >> > >>>>>>>>>> necessarily need integration & perf tests, but for
> >heavily
> >> used
> >> > IO
> >> > >>>>>>>>>> transforms, there's a lot of community value in these
> >tests.
> >> The
> >> > >>>>>>>>>> maintenance proposal below scales with the amount of
> >community
> >> > >>>>>>>>>> support
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>> a particular IO transform.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Reusing data stores ("use the data stores across
> >executions.")
> >> > >>>>>>>>>> ------------------
> >> > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently
> >used, very
> >> > >>>>>>>>>> small
> >> > >>>>>>>>>> instances that we keep up all the time + larger
> >> multi-container
> >> > >>>>>>>>>> data
> >> > >>>>>>>>>> store
> >> > >>>>>>>>>> instances that we spin up for perf tests.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I don't think we need to have a strong answer to this
> >> question,
> >> > >>>>>>>>>> but I
> >> > >>>>>>>>>> think
> >> > >>>>>>>>>> we do need to know what range of capabilities we need,
> >and use
> >> > >>>>>>>>>> that to
> >> > >>>>>>>>>> inform our requirements on the hosting infrastructure. I
> >think
> >> > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios
> >I
> >> > discuss
> >> > >>>>>>>>>>
> >> > >>>>>>>>> below.
> >> > >>>>>>
> >> > >>>>>>> I had been thinking of a hybrid approach - reuse some
> >instances
> >> and
> >> > >>>>>>>>>>
> >> > >>>>>>>>> don't
> >> > >>>>>>>>
> >> > >>>>>>>>> reuse others. Some tests require isolation from other
> >tests
> >> (eg.
> >> > >>>>>>>>>> performance benchmarking), while others can easily
> >re-use the
> >> > same
> >> > >>>>>>>>>> database/data store instance over time, provided they
> >are
> >> > >>>>>>>>>> written in
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> correct manner (eg. a simple read or write correctness
> >> integration
> >> > >>>>>>>>>>
> >> > >>>>>>>>> tests)
> >> > >>>>>>>>
> >> > >>>>>>>>> To me, the question of whether to use one instance over
> >time
> >> for
> >> > a
> >> > >>>>>>>>>> test vs
> >> > >>>>>>>>>> spin up an instance for each test comes down to a trade
> >off
> >> > >>>>>>>>>> between
> >> > >>>>>>>>>>
> >> > >>>>>>>>> these
> >> > >>>>>>>>
> >> > >>>>>>>>> factors:
> >> > >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super
> >flaky,
> >> > >>>>>>>>>> we'll
> >> > >>>>>>>>>> want to
> >> > >>>>>>>>>> keep more instances up and running rather than bring
> >them
> >> > up/down.
> >> > >>>>>>>>>>
> >> > >>>>>>>>> (this
> >> > >>>>>>
> >> > >>>>>>> may also vary by the data store in question)
> >> > >>>>>>>>>> 2. Frequency of testing - if we are running tests every
> >5
> >> > >>>>>>>>>> minutes, it
> >> > >>>>>>>>>>
> >> > >>>>>>>>> may
> >> > >>>>>>>>
> >> > >>>>>>>>> be wasteful to bring machines up/down every time. If we
> >run
> >> > >>>>>>>>>> tests once
> >> > >>>>>>>>>>
> >> > >>>>>>>>> a
> >> > >>>>>>
> >> > >>>>>>> day or week, it seems wasteful to keep the machines up the
> >whole
> >> > >>>>>>>>>> time.
> >> > >>>>>>>>>> 3. Isolation requirements - If tests must be isolated,
> >it
> >> means
> >> > we
> >> > >>>>>>>>>>
> >> > >>>>>>>>> either
> >> > >>>>>>>>
> >> > >>>>>>>>> have to bring up the instances for each test, or we have
> >to
> >> have
> >> > >>>>>>>>>> some
> >> > >>>>>>>>>> sort
> >> > >>>>>>>>>> of signaling mechanism to indicate that a given instance
> >is in
> >> > >>>>>>>>>> use. I
> >> > >>>>>>>>>> strongly favor bringing up an instance per test.
> >> > >>>>>>>>>> 4. Number/size of containers - if we need a large number
> >of
> >> > >>>>>>>>>> machines
> >> > >>>>>>>>>> for a
> >> > >>>>>>>>>> particular test, keeping them running all the time will
> >use
> >> more
> >> > >>>>>>>>>> resources.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> The major unknown to me is how flaky it'll be to spin
> >these
> >> up.
> >> > >>>>>>>>>> I'm
> >> > >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up,
> >but I
> >> > >>>>>>>>>> think the
> >> > >>>>>>>>>> best
> >> > >>>>>>>>>> way to test that is to start doing it.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I suspect the sweet spot is the following: have a set of
> >very
> >> > >>>>>>>>>> small
> >> > >>>>>>>>>>
> >> > >>>>>>>>> data
> >> > >>>>>>
> >> > >>>>>>> store instances that stay up to support small-data-size
> >> post-commit
> >> > >>>>>>>>>> end to
> >> > >>>>>>>>>> end tests (post-commits run frequently and the data size
> >means
> >> > the
> >> > >>>>>>>>>> instances would not use many resources), combined with
> >the
> >> > >>>>>>>>>> ability to
> >> > >>>>>>>>>> spin
> >> > >>>>>>>>>> up larger instances for once a day/week performance
> >benchmarks
> >> > >>>>>>>>>> (these
> >> > >>>>>>>>>>
> >> > >>>>>>>>> use
> >> > >>>>>>>>
> >> > >>>>>>>>> up more resources and are used less frequently.) That's
> >the mix
> >> > >>>>>>>>>> I'll
> >> > >>>>>>>>>> propose in my docs on testing IO transforms.  If
> >spinning up
> >> new
> >> > >>>>>>>>>> instances
> >> > >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of
> >spinning up
> >> > >>>>>>>>>> instances
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>> each test.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Management ("what's the overhead of managing such a
> >> deployment")
> >> > >>>>>>>>>> --------------------
> >> > >>>>>>>>>> Summary: I propose that anyone can contribute scripts
> >for
> >> > >>>>>>>>>> setting up
> >> > >>>>>>>>>>
> >> > >>>>>>>>> data
> >> > >>>>>>>>
> >> > >>>>>>>>> store instances + integration/perf tests, but if the
> >community
> >> > >>>>>>>>>> doesn't
> >> > >>>>>>>>>> maintain a particular data store's tests, we disable the
> >tests
> >> > and
> >> > >>>>>>>>>> turn off
> >> > >>>>>>>>>> the data store instances.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Management of these instances is a crucial question.
> >First,
> >> > let's
> >> > >>>>>>>>>>
> >> > >>>>>>>>> break
> >> > >>>>>
> >> > >>>>>> down what tasks we'll need to do on a recurring basis:
> >> > >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both
> >instance
> >> &
> >> > >>>>>>>>>> dependencies) - we don't want to have a lot of old
> >versions
> >> > >>>>>>>>>> vulnerable
> >> > >>>>>>>>>>
> >> > >>>>>>>>> to
> >> > >>>>>>>>
> >> > >>>>>>>>> attacks/buggy
> >> > >>>>>>>>>> 2. Investigate breakages/regressions
> >> > >>>>>>>>>> (I'm betting there will be more things we'll discover -
> >let me
> >> > >>>>>>>>>> know if
> >> > >>>>>>>>>> you
> >> > >>>>>>>>>> have suggestions)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> There's a couple goals I see:
> >> > >>>>>>>>>> 1. We should only do sys admin work for things that give
> >us a
> >> > >>>>>>>>>> lot of
> >> > >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up
> >scripts
> >> for
> >> > >>>>>>>>>> data
> >> > >>>>>>>>>> stores
> >> > >>>>>>>>>> without a large community)
> >> > >>>>>>>>>> 2. We should do as much as possible of testing via
> >> > >>>>>>>>>> in-memory/embedded
> >> > >>>>>>>>>> testing (as you brought up).
> >> > >>>>>>>>>> 3. Reduce the amount of manual administration overhead
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> As I discussed above, I think that integration
> >> tests/performance
> >> > >>>>>>>>>> benchmarks
> >> > >>>>>>>>>> are costly things that we should do only for the IO
> >transforms
> >> > >>>>>>>>>> with
> >> > >>>>>>>>>>
> >> > >>>>>>>>> large
> >> > >>>>>>>>
> >> > >>>>>>>>> amounts of community support/usage. Thus, I propose that
> >we
> >> > >>>>>>>>>> limit the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> IO
> >> > >>>>>>
> >> > >>>>>>> transforms that get integration tests & performance
> >benchmarks to
> >> > >>>>>>>>>>
> >> > >>>>>>>>> those
> >> > >>>>>
> >> > >>>>>> that have community support for maintaining the data store
> >> > >>>>>>>>>> instances.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> We can enforce this organically using some simple rules:
> >> > >>>>>>>>>> 1. Investigating breakages/regressions: if a given
> >> > >>>>>>>>>> integration/perf
> >> > >>>>>>>>>>
> >> > >>>>>>>>> test
> >> > >>>>>>
> >> > >>>>>>> starts failing and no one investigates it within a set
> >period of
> >> > >>>>>>>>>> time
> >> > >>>>>>>>>>
> >> > >>>>>>>>> (a
> >> > >>>>>>
> >> > >>>>>>> week?), we disable the tests and shut off the data store
> >> > >>>>>>>>>> instances if
> >> > >>>>>>>>>>
> >> > >>>>>>>>> we
> >> > >>>>>>
> >> > >>>>>>> have instances running. When someone wants to step up and
> >> > >>>>>>>>>> support it
> >> > >>>>>>>>>> again,
> >> > >>>>>>>>>> they can fix the test, check it in, and re-enable the
> >test.
> >> > >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira
> >issue that
> >> > >>>>>>>>>> is just
> >> > >>>>>>>>>> "is
> >> > >>>>>>>>>> the IO Transform X data store up to date?" - if the jira
> >is
> >> not
> >> > >>>>>>>>>> resolved in
> >> > >>>>>>>>>> a set period of time (1 month?), the perf/integration
> >tests
> >> are
> >> > >>>>>>>>>>
> >> > >>>>>>>>> disabled,
> >> > >>>>>>>>
> >> > >>>>>>>>> and the data store instances shut off.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> This is pretty flexible -
> >> > >>>>>>>>>> * If a particular person or organization wants to
> >support an
> >> IO
> >> > >>>>>>>>>> transform,
> >> > >>>>>>>>>> they can. If a group of people all organically organize
> >to
> >> keep
> >> > >>>>>>>>>> the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> tests
> >> > >>>>>>>>
> >> > >>>>>>>>> running, they can.
> >> > >>>>>>>>>> * It can be mostly automated - there's not a lot of
> >central
> >> > >>>>>>>>>> organizing
> >> > >>>>>>>>>> work
> >> > >>>>>>>>>> that needs to be done.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Exposing the information about what IO transforms
> >currently
> >> have
> >> > >>>>>>>>>>
> >> > >>>>>>>>> running
> >> > >>>>>>
> >> > >>>>>>> IT/perf benchmarks on the website will let users know what
> >IO
> >> > >>>>>>>>>>
> >> > >>>>>>>>> transforms
> >> > >>>>>>
> >> > >>>>>>> are well supported.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I like this solution, but I also recognize this is a
> >tricky
> >> > >>>>>>>>>> problem.
> >> > >>>>>>>>>>
> >> > >>>>>>>>> This
> >> > >>>>>>>>
> >> > >>>>>>>>> is something the community needs to be supportive of, so
> >I'm
> >> > >>>>>>>>>> open to
> >> > >>>>>>>>>> other
> >> > >>>>>>>>>> thoughts.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Simulating failures in real nodes ("programmatic tests
> >to
> >> > simulate
> >> > >>>>>>>>>> failure")
> >> > >>>>>>>>>> -----------------
> >> > >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We
> >should
> >> > >>>>>>>>>> encourage a
> >> > >>>>>>>>>> design pattern separating out network/retry logic from
> >the
> >> main
> >> > IO
> >> > >>>>>>>>>> transform logic
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> We *could* create instance failure in any container
> >management
> >> > >>>>>>>>>>
> >> > >>>>>>>>> software
> >> > >>>>>
> >> > >>>>>> -
> >> > >>>>>>>>
> >> > >>>>>>>>> we can use their programmatic APIs to determine which
> >> containers
> >> > >>>>>>>>>> are
> >> > >>>>>>>>>> running the instances, and ask them to kill the
> >container in
> >> > >>>>>>>>>> question.
> >> > >>>>>>>>>>
> >> > >>>>>>>>> A
> >> > >>>>>>
> >> > >>>>>>> slow node would be trickier, but I'm sure we could figure
> >it out
> >> > >>>>>>>>>> - for
> >> > >>>>>>>>>> example, add a network proxy that would delay responses.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> However, I would argue that this type of testing doesn't
> >gain
> >> > us a
> >> > >>>>>>>>>> lot, and
> >> > >>>>>>>>>> is complicated to set up. I think it will be easier to
> >test
> >> > >>>>>>>>>> network
> >> > >>>>>>>>>> errors
> >> > >>>>>>>>>> and retry behavior in unit tests for the IO transforms.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Part of the way to handle this is to separate out the
> >read
> >> code
> >> > >>>>>>>>>> from
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> network code (eg. bigtable has BigtableService). If you put
> >the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> "handle
> >> > >>>>>
> >> > >>>>>> errors/retry logic" code in a separate MySourceService
> >class,
> >> > >>>>>>>>>> you can
> >> > >>>>>>>>>> test
> >> > >>>>>>>>>> MySourceService on the wide variety of networks
> >errors/data
> >> > store
> >> > >>>>>>>>>> problems,
> >> > >>>>>>>>>> and then your main IO transform tests focus on the read
> >> behavior
> >> > >>>>>>>>>> and
> >> > >>>>>>>>>> handling the small set of errors the MySourceService
> >class
> >> will
> >> > >>>>>>>>>>
> >> > >>>>>>>>> return.
> >> > >>>>>
> >> > >>>>>> I also think we should focus on testing the IO Transform,
> >not
> >> > >>>>>>>>>> the data
> >> > >>>>>>>>>> store - if we kill a node in a data store, it's that
> >data
> >> > store's
> >> > >>>>>>>>>> problem,
> >> > >>>>>>>>>> not beam's problem. As you were pointing out, there are
> >a
> >> > *large*
> >> > >>>>>>>>>> number of
> >> > >>>>>>>>>> possible ways that a particular data store can fail, and
> >we
> >> > >>>>>>>>>> would like
> >> > >>>>>>>>>>
> >> > >>>>>>>>> to
> >> > >>>>>>>>
> >> > >>>>>>>>> support many different data stores. Rather than try to
> >test
> >> that
> >> > >>>>>>>>>> each
> >> > >>>>>>>>>> data
> >> > >>>>>>>>>> store behaves well, we should ensure that we handle
> >> > >>>>>>>>>> generic/expected
> >> > >>>>>>>>>> errors
> >> > >>>>>>>>>> in a graceful manner.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Ismaeal had a couple other quick comments/questions,
> >I'll
> >> answer
> >> > >>>>>>>>>> here
> >> > >>>>>>>>>>
> >> > >>>>>>>>> -
> >> > >>>>>
> >> > >>>>>> We can use this to test other runners running on multiple
> >> > >>>>>>>>>> machines - I
> >> > >>>>>>>>>> agree. This is also necessary for a good performance
> >benchmark
> >> > >>>>>>>>>> test.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> "providing the test machines to mount the cluster" - we
> >can
> >> > >>>>>>>>>> discuss
> >> > >>>>>>>>>>
> >> > >>>>>>>>> this
> >> > >>>>>>
> >> > >>>>>>> further, but one possible option is that google may be
> >willing to
> >> > >>>>>>>>>>
> >> > >>>>>>>>> donate
> >> > >>>>>>
> >> > >>>>>>> something to support this.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> "IO Consistency" - let's follow up on those questions in
> >> another
> >> > >>>>>>>>>>
> >> > >>>>>>>>> thread.
> >> > >>>>>>
> >> > >>>>>>> That's as much about the public interface we provide to
> >users as
> >> > >>>>>>>>>>
> >> > >>>>>>>>> anything
> >> > >>>>>>>>
> >> > >>>>>>>>> else. I agree with your sentiment that a user should be
> >able to
> >> > >>>>>>>>>> expect
> >> > >>>>>>>>>> predictable behavior from the different IO transforms.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Thanks for everyone's questions/comments - I really am
> >excited
> >> > >>>>>>>>>> to see
> >> > >>>>>>>>>> that
> >> > >>>>>>>>>> people care about this :)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Stephen
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <
> >> [email protected]
> >> > >
> >> > >>>>>>>>>>
> >> > >>>>>>>>> wrote:
> >> > >>>>>
> >> > >>>>>> Hello,
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> @Stephen Thanks for your proposal, it is really
> >interesting,
> >> I
> >> > >>>>>>>>>>> would
> >> > >>>>>>>>>>> really
> >> > >>>>>>>>>>> like to help with this. I have never played with
> >Kubernetes
> >> but
> >> > >>>>>>>>>>> this
> >> > >>>>>>>>>>> seems
> >> > >>>>>>>>>>> a really nice chance to do something useful with it.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> We (at Talend) are testing most of the IOs using simple
> >> > container
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>> images
> >> > >>>>>>>>
> >> > >>>>>>>>> and in some particular cases ‘clusters’ of containers
> >using
> >> > >>>>>>>>>>> docker-compose
> >> > >>>>>>>>>>> (a little bit like Amit’s (2) proposal). It would be
> >really
> >> > >>>>>>>>>>> nice to
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>> have
> >> > >>>>>>>>
> >> > >>>>>>>>> this at the Beam level, in particular to try to test more
> >> complex
> >> > >>>>>>>>>>> semantics, I don’t know how programmable kubernetes is
> >to
> >> > achieve
> >> > >>>>>>>>>>> this for
> >> > >>>>>>>>>>> example:
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Let’s think we have a cluster of Cassandra or Kafka
> >nodes, I
> >> > >>>>>>>>>>> would
> >> > >>>>>>>>>>> like to
> >> > >>>>>>>>>>> have programmatic tests to simulate failure (e.g. kill
> >a
> >> node),
> >> > >>>>>>>>>>> or
> >> > >>>>>>>>>>> simulate
> >> > >>>>>>>>>>> a really slow node, to ensure that the IO behaves as
> >expected
> >> > >>>>>>>>>>> in the
> >> > >>>>>>>>>>> Beam
> >> > >>>>>>>>>>> pipeline for the given runner.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Another related idea is to improve IO consistency:
> >Today the
> >> > >>>>>>>>>>> different IOs
> >> > >>>>>>>>>>> have small differences in their failure behavior, I
> >really
> >> > >>>>>>>>>>> would like
> >> > >>>>>>>>>>> to be
> >> > >>>>>>>>>>> able to predict with more precision what will happen in
> >case
> >> of
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>> errors,
> >> > >>>>>>
> >> > >>>>>>> e.g. what is the correct behavior if I am writing to a
> >Kafka
> >> > >>>>>>>>>>> node and
> >> > >>>>>>>>>>> there
> >> > >>>>>>>>>>> is a network partition, does the Kafka sink retries or
> >no ?
> >> and
> >> > >>>>>>>>>>> what
> >> > >>>>>>>>>>> if it
> >> > >>>>>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming
> >> > >>>>>>>>>>> checkpointing?
> >> > >>>>>>>>>>> Or do
> >> > >>>>>>>>>>> we guarantee exactly once writes somehow?, today I am
> >not
> >> sure
> >> > >>>>>>>>>>> about
> >> > >>>>>>>>>>> what
> >> > >>>>>>>>>>> happens (or if the expected behavior depends on the
> >runner),
> >> > >>>>>>>>>>> but well
> >> > >>>>>>>>>>> maybe
> >> > >>>>>>>>>>> it is just that I don’t know and we have tests to
> >ensure
> >> this.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Of course both are really hard problems, but I think
> >with
> >> your
> >> > >>>>>>>>>>> proposal we
> >> > >>>>>>>>>>> can try to tackle them, as well as the performance
> >ones. And
> >> > >>>>>>>>>>> apart of
> >> > >>>>>>>>>>> the
> >> > >>>>>>>>>>> data stores, I think it will be also really nice to be
> >able
> >> to
> >> > >>>>>>>>>>> test
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> runners in a distributed manner.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> So what is the next step? How do you imagine such
> >integration
> >> > >>>>>>>>>>> tests?
> >> > >>>>>>>>>>> ? Who
> >> > >>>>>>>>>>> can provide the test machines so we can mount the
> >cluster?
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Maybe my ideas are a bit too far away for an initial
> >setup,
> >> but
> >> > >>>>>>>>>>> it
> >> > >>>>>>>>>>> will be
> >> > >>>>>>>>>>> really nice to start working on this.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Ismael
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <
> >> > >>>>>>>>>>> [email protected]
> >> > >>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Hi Stephen,
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> I was wondering about how we plan to use the data
> >stores
> >> > across
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> executions.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> Clearly, it's best to setup a new instance (container)
> >for
> >> > every
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> test,
> >> > >>>>>>
> >> > >>>>>>> running a "standalone" store (say HBase/Cassandra for
> >> > >>>>>>>>>>>> example), and
> >> > >>>>>>>>>>>> once
> >> > >>>>>>>>>>>> the test is done, teardown the instance. It should
> >also be
> >> > >>>>>>>>>>>> agnostic
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> to
> >> > >>>>>>
> >> > >>>>>>> the
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> runtime environment (e.g., Docker on Kubernetes).
> >> > >>>>>>>>>>>> I'm wondering though what's the overhead of managing
> >such a
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> deployment
> >> > >>>>>>
> >> > >>>>>>> which could become heavy and complicated as more IOs are
> >> > >>>>>>>>>>>> supported
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> and
> >> > >>>>>>
> >> > >>>>>>> more
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> test cases introduced.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> Another way to go would be to have small clusters of
> >> different
> >> > >>>>>>>>>>>> data
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> stores
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> and run against new "namespaces" (while lazily
> >evicting old
> >> > >>>>>>>>>>>> ones),
> >> > >>>>>>>>>>>> but I
> >> > >>>>>>>>>>>> think this is less likely as maintaining a distributed
> >> > instance
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> (even
> >> > >>>>>
> >> > >>>>>> a
> >> > >>>>>>>>
> >> > >>>>>>>>> small one) for each data store sounds even more complex.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> A third approach would be to to simply have an
> >"embedded"
> >> > >>>>>>>>>>>> in-memory
> >> > >>>>>>>>>>>> instance of a data store as part of a test that runs
> >against
> >> > it
> >> > >>>>>>>>>>>> (such as
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> an
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> embedded Kafka, though not a data store).
> >> > >>>>>>>>>>>> This is probably the simplest solution in terms of
> >> > >>>>>>>>>>>> orchestration,
> >> > >>>>>>>>>>>> but it
> >> > >>>>>>>>>>>> depends on having a proper "embedded" implementation
> >for an
> >> > IO.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> Does this make sense to you ? have you considered it ?
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> Thanks,
> >> > >>>>>>>>>>>> Amit
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> [email protected]
> >> > >>>>>
> >> > >>>>>> wrote:
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> Hi Stephen,
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> as already discussed a bit together, it sounds great
> >!
> >> > >>>>>>>>>>>>> Especially I
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>> like
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> it as a both integration test platform and good
> >coverage for
> >> > >>>>>>>>>>>>> IOs.
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>

Re: Hosting data stores for IO Transform testing

Reply via email to