Re: Hosting data stores for IO Transform testing

Ismaël Mejía Wed, 18 Jan 2017 08:27:05 -0800

Thanks for your analysis Stephen, good arguments / references.

One quick question. Have you checked the APIs of both (Mesos/Kubernetes) to
see
if we can do programmatically do more complex tests (I suppose so, but you
don't mention how easy or if those are possible), for example to simulate a
slow networking slave (to test stragglers), or to arbitrarily kill one
slave (e.g. if I want to test the correct behavior of a runner/IO that is
reading from it) ?


Other missing point in the review is the availability of ready to play
packages,
I think in this area mesos/dcos seems more advanced no? I haven't looked
recently but at least 6 months ago there were not many helm packages ready
for
example to test kafka or the hadoop echosystem stuff (hdfs, hbase, etc). Has
this been improved ? because preparing this also is a considerable amount of
work on the other hand this could be also a chance to contribute to
kubernetes.

Regards,
Ismaël



On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <[email protected]>
wrote:

> hi!
>
> I've been continuing this investigation, and have some more info to report,
> and hopefully we can start making some decisions.
>
> To support performance testing, I've been investigating mesos+marathon and
> kubernetes for running data stores in their high availability mode. I have
> been examining features that kubernetes/mesos+marathon use to support this.
>
> Setting up a multi-node cluster in a high availability mode tends to be
> more expensive time-wise than the single node instances I've played around
> with in the past. Rather than do a full build out with both kubernetes and
> mesos, I'd like to pick one of the two options to build the prototype
> cluster with. If the prototype doesn't go well, we could still go back to
> the other option, but I'd like to change us from a mode of "let's look at
> all the options" to one of "here's the favorite, let's prove that works for
> us".
>
> Below are the features that I've seen are important to multi-node instances
> of data stores. I'm sure other folks on the list have done this before, so
> feel free to pipe up if I'm missing a good solution to a problem.
>
> DNS/Discovery
>
> --------------------
>
> Necessary for talking between nodes (eg, cassandra nodes all need to be
> able to talk to a set of seed nodes.)
>
> * Kubernetes has built-in DNS/discovery between nodes.
>
> * Mesos has supports this via mesos-dns, which isn't a part of core mesos,
> but is in dcos, which is the mesos distribution I've been using and that I
> would expect us to use.
>
> Instances properly distributed across nodes
>
> ------------------------------------------------------------
>
> If multiple instances of a data source end up on the same underlying VM, we
> may not get good performance out of those instances since the underlying VM
> may be more taxed than other VMs.
>
> * Kubernetes has a beta feature StatefulSets[1] which allow for containers
> distributed so that there's one container per underlying machine (as well
> as a lot of other useful features like easy stable dns names.)
>
> * Mesos can support this via the built in UNIQUE constraint [2]
>
> Load balancing
>
> --------------------
>
> Incoming requests from users need to be distributed to the various machines
> - this is important for many data stores' high availability modes.
>
> * Kubernetes supports easily hooking up to an external load balancer when
> on a cloud (and can be configured to work with a built-in load balancer if
> not)
>
> * Mesos supports this via marathon-lb [3], which is an install-able package
> in DC/OS
>
> Persistent Volumes tied to specific instances
>
> ------------------------------------------------------------
>
> Databases often need persistent state (for example to store the data :), so
> it's an important part of running our service.
>
> * Kubernetes StatefulSets supports this
>
> * Mesos+marathon apps with persistent volumes supports this [4] [5]
>
> As I mentioned above, I'd like to focus on either kubernetes or mesos for
> my investigation, and as I go further along, I'm seeing kubernetes as
> better suited to our needs.
>
> (1) It supports more of the features we want out of the box and with
> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
> requires marathon-lb to be installed and mesos-dns to be configured.
>
> (2) I'm also finding that there seem to be more examples of using
> kubernetes to solve the types of problems we're working on. This is
> somewhat subjective, but in my experience as I've tried to learn both
> kubernetes and mesos, I personally found it generally easier to get
> kubernetes running than mesos due to the tutorials/examples available for
> kubernetes.
>
> (3) Lower cost of initial setup - as I discussed in a previous mail[6],
> kubernetes was far easier to get set up even when I knew the exact steps.
> Mesos took me around 27 steps [7], which involved a lot of config that was
> easy to get wrong (it took me about 5 tries to get all the steps correct in
> one go.) Kubernetes took me around 8 steps and very little config.
>
> Given that, I'd like to focus my investigation/prototyping on Kubernetes.
> To
> be clear, it's fairly close and I think both Mesos and Kubernetes could
> support what we need, so if we run into issues with kubernetes, Mesos still
> seems like a viable option that we could fall back to.
>
> Thanks,
> Stephen
>
>
> [1] Kubernetes StatefulSets
> https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/
>
> [2] mesos unique constraint -
> https://mesosphere.github.io/marathon/docs/constraints.html
>
> [3]
> https://mesosphere.github.io/marathon/docs/service-
> discovery-load-balancing.html
>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
>
> [4] https://mesosphere.github.io/marathon/docs/persistent-volumes.html
>
> [5] https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
>
> [6] Container Orchestration software for hosting data stores
> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>
> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
>
>
> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <[email protected]> wrote:
>
> > Just a quick drive-by comment: how tests are laid out has non-trivial
> > tradeoffs on how/where continuous integration runs, and how results are
> > integrated into the tooling. The current state is certainly not ideal
> > (e.g., due to multiple test executions some links in Jenkins point where
> > they shouldn't), but most other alternatives had even bigger drawbacks at
> > the time. If someone has great ideas that don't explode the number of
> > modules, please share ;-)
> >
> > On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot <[email protected]>
> > wrote:
> >
> > > Hi Stephen,
> > >
> > > Thanks for taking the time to comment.
> > >
> > > My comments are bellow in the email:
> > >
> > >
> > > Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
> > >
> > >> hey Etienne -
> > >>
> > >> thanks for your thoughts and thanks for sharing your experiences. I
> > >> generally agree with what you're saying. Quick comments below:
> > >>
> > >> IT are stored alongside with UT in src/test directory of the IO but
> they
> > >>>
> > >> might go to dedicated module, waiting for a consensus
> > >> I don't have a strong opinion or feel that I've worked enough with
> maven
> > >> to
> > >> understand all the consequences - I'd love for someone with more maven
> > >> experience to weigh in. If this becomes blocking, I'd say check it in,
> > and
> > >> we can refactor later if it proves problematic.
> > >>
> > > Sure, not a blocking point, it could be refactored afterwards. Just as
> a
> > > reminder, JB mentioned that storing IT in separate module allows to
> have
> > > more coherence between all IT (same behavior) and to do cross IO
> > > integration tests. JB, have you experienced some long term drawbacks of
> > > storing IT in a separate module, like, for example, more difficult
> > > maintenance due to "distance" with production code?
> > >
> > >
> > >>   Also IMHO, it is better that tests load/clean data than doing some
> > >>>
> > >> assumptions about the running order of the tests.
> > >> I definitely agree that we don't want to make assumptions about the
> > >> running
> > >> order of the tests - that way lies pain. :) It will be interesting to
> > see
> > >> how the performance tests work out since they will need more data (and
> > >> thus
> > >> loading data can take much longer.)
> > >>
> > > Yes, performance testing might push in the direction of data loading
> from
> > > outside the tests due to loading time.
> > >
> > >>   This should also be an easier problem
> > >> for read tests than for write tests - if we have long running
> instances,
> > >> read tests don't really need cleanup. And if write tests only write a
> > >> small
> > >> amount of data, as long as we are sure we're writing to uniquely
> > >> identifiable locations (ie, new table per test or something similar),
> we
> > >> can clean up the write test data on a slower schedule.
> > >>
> > > I agree
> > >
> > >>
> > >> this will tend to go to the direction of long running data store
> > >>>
> > >> instances rather than data store instances started (and optionally
> > loaded)
> > >> before tests.
> > >> It may be easiest to start with a "data stores stay running"
> > >> implementation, and then if we see issues with that move towards tests
> > >> that
> > >> start/stop the data stores on each run. One thing I'd like to make
> sure
> > is
> > >> that we're not manually tweaking the configurations for data stores.
> One
> > >> way we could do that is to destroy/recreate the data stores on a
> slower
> > >> schedule - maybe once per week. That way if the script is changed or
> the
> > >> data store instances are changed, we'd be able to detect it relatively
> > >> soon
> > >> while still removing the need for the tests to manage the data stores.
> > >>
> > > I agree. In addition to configuration manual tweaking, there might be
> > > cases in which a data store re-partition data during a test or after
> some
> > > tests while the dataset changes. The IO must be tolerant to that but
> the
> > > asserts (number of bundles for example) in test must not fail in that
> > case.
> > > I would also prefer if possible that the tests do not manage data
> stores
> > > (not setup them, not start them, not stop them)
> > >
> > >
> > >> as a general note, I suspect many of the folks in the states will be
> on
> > >> holiday until Jan 2nd/3rd.
> > >>
> > >> S
> > >>
> > >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot <[email protected]
> >
> > >> wrote:
> > >>
> > >> Hi,
> > >>>
> > >>> Recently we had a discussion about integration tests of IOs. I'm
> > >>> preparing a PR for integration tests of the elasticSearch IO
> > >>> (
> > >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
> > >>> LASTICSEARCH-IO
> > >>> as a first shot) which are very important IMHO because they helped
> > catch
> > >>> some bugs that UT could not (volume, data store instance sharing,
> real
> > >>> data store instance ...)
> > >>>
> > >>> I would like to have your thoughts/remarks about points bellow. Some
> of
> > >>> these points are also discussed here
> > >>>
> > >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> > >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> > >>> :
> > >>>
> > >>> - UT and IT have a similar architecture, but while UT focus on
> testing
> > >>> the correct behavior of the code including corner cases and use
> > embedded
> > >>> in memory data store, IT assume that the behavior is correct (strong
> > UT)
> > >>> and focus on higher volume testing and testing against real data
> store
> > >>> instance(s)
> > >>>
> > >>> - For now, IT are stored alongside with UT in src/test directory of
> the
> > >>> IO but they might go to dedicated module, waiting for a consensus.
> > Maven
> > >>> is not configured to run them automatically because data store is not
> > >>> available on jenkins server yet
> > >>>
> > >>> - For now, they only use DirectRunner, but they will  be run against
> > >>> each runner.
> > >>>
> > >>> - IT do not setup data store instance (like stated in the above
> > >>> document) they assume that one is already running (hardcoded
> > >>> configuration in test for now, waiting for a common solution to pass
> > >>> configuration to IT). A docker container script is provided in the
> > >>> contrib directory as a starting point to whatever orchestration
> > software
> > >>> will be chosen.
> > >>>
> > >>> - IT load and clean test data before and after each test if needed.
> It
> > >>> is simpler to do so because some tests need empty data store (write
> > >>> test) and because, as discussed in the document, tests might not be
> the
> > >>> only users of the data store. Also IMHO, it is better that tests
> > >>> load/clean data than doing some assumptions about the running order
> of
> > >>> the tests.
> > >>>
> > >>> If we generalize this pattern to all IT tests, this will tend to go
> to
> > >>> the direction of long running data store instances rather than data
> > >>> store instances started (and optionally loaded) before tests.
> > >>>
> > >>> Besides if we where to change our minds and load data from outside
> the
> > >>> tests, a logstash script is provided.
> > >>>
> > >>> If you have any thoughts or remarks I'm all ears :)
> > >>>
> > >>> Regards,
> > >>>
> > >>> Etienne
> > >>>
> > >>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> > >>>
> > >>>> Hi Stephen,
> > >>>>
> > >>>> the purpose of having in a specific module is to share resources and
> > >>>> apply the same behavior from IT perspective and be able to have IT
> > >>>> "cross" IO (for instance, reading from JMS and sending to Kafka, I
> > >>>> think that's the key idea for integration tests).
> > >>>>
> > >>>> For instance, in Karaf, we have:
> > >>>> - utest in each module
> > >>>> - itest module containing itests for all modules all together
> > >>>>
> > >>>> Regards
> > >>>> JB
> > >>>>
> > >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> > >>>>
> > >>>>> Hi Etienne,
> > >>>>>
> > >>>>> thanks for following up and answering my questions.
> > >>>>>
> > >>>>> re: where to store integration tests - having them all in a
> separate
> > >>>>> module
> > >>>>> is an interesting idea. I couldn't find JB's comments about moving
> > them
> > >>>>> into a separate module in the PR - can you share the reasons for
> > >>>>> doing so?
> > >>>>> The IO integration/perf tests so it does seem like they'll need to
> be
> > >>>>> treated in a special manner, but given that there is already an IO
> > >>>>> specific
> > >>>>> module, it may just be that we need to treat all the ITs in the IO
> > >>>>> module
> > >>>>> the same way. I don't have strong opinions either way right now.
> > >>>>>
> > >>>>> S
> > >>>>>
> > >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
> > [email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>> Hi guys,
> > >>>>>
> > >>>>> @Stephen: I addressed all your comments directly in the PR, thanks!
> > >>>>> I just wanted to comment here about the docker image I used: the
> only
> > >>>>> official Elastic image contains only ElasticSearch. But for
> testing I
> > >>>>> needed logstash (for ingestion) and kibana (not for integration
> > tests,
> > >>>>> but to easily test REST requests to ES using sense). This is why I
> > use
> > >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one isreleased
> > >>>>> under
> > >>>>> theapache 2 license.
> > >>>>>
> > >>>>>
> > >>>>> Besides, there is also a point about where to store integration
> > tests:
> > >>>>> JB proposed in the PR to store integration tests to dedicated
> module
> > >>>>> rather than directly in the IO module (like I did).
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Etienne
> > >>>>>
> > >>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> > >>>>>
> > >>>>>> hey!
> > >>>>>>
> > >>>>>> thanks for sending this. I'm very excited to see this change. I
> > >>>>>> added some
> > >>>>>> detail-oriented code review comments in addition to what I've
> > >>>>>> discussed
> > >>>>>> here.
> > >>>>>>
> > >>>>>> The general goal is to allow for re-usable instantiation of
> > particular
> > >>>>>>
> > >>>>> data
> > >>>>>
> > >>>>>> store instances and this seems like a good start. Looks like you
> > >>>>>> also have
> > >>>>>> a script to generate test data for your tests - that's great.
> > >>>>>>
> > >>>>>> The next steps (definitely not blocking your work) will be to have
> > >>>>>> ways to
> > >>>>>> create instances from the docker images you have here, and use
> them
> > >>>>>> in the
> > >>>>>> tests. We'll need support in the test framework for that since
> it'll
> > >>>>>> be
> > >>>>>> different on developer machines and in the beam jenkins cluster,
> but
> > >>>>>> your
> > >>>>>> scripts here allow someone running these tests locally to not have
> > to
> > >>>>>>
> > >>>>> worry
> > >>>>>
> > >>>>>> about getting the instance set up and can manually adjust, so this
> > is
> > >>>>>> a
> > >>>>>> good incremental step.
> > >>>>>>
> > >>>>>> I have some thoughts now that I'm reviewing your scripts (that I
> > >>>>>> didn't
> > >>>>>> have previously, so we are learning this together):
> > >>>>>> * It may be useful to try and document why we chose a particular
> > >>>>>> docker
> > >>>>>> image as the base (ie, "this is the official supported elastic
> > search
> > >>>>>> docker image" or "this image has several data stores together that
> > >>>>>> can be
> > >>>>>> used for a couple different tests")  - I'm curious as to whether
> the
> > >>>>>> community thinks that is important
> > >>>>>>
> > >>>>>> One thing that I called out in the comment that's worth mentioning
> > >>>>>> on the
> > >>>>>> larger list - if you want to specify which specific runners a test
> > >>>>>> uses,
> > >>>>>> that can be controlled in the pom for the module. I updated the
> > >>>>>> testing
> > >>>>>>
> > >>>>> doc
> > >>>>>
> > >>>>>> mentioned previously in this thread with a TODO to talk about this
> > >>>>>> more. I
> > >>>>>> think we should also make it so that IO modules have that
> > >>>>>> automatically,
> > >>>>>>
> > >>>>> so
> > >>>>>
> > >>>>>> developers don't have to worry about it.
> > >>>>>>
> > >>>>>> S
> > >>>>>>
> > >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
> > [email protected]>
> > >>>>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Stephen,
> > >>>>>>
> > >>>>>> As discussed, I added injection script, docker containers scripts
> > and
> > >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
> > >>>>>> <
> > >>>>>>
> > >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
> > >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
> > >>> d824cefcb3ed0b9
> > >>>
> > >>>> directory in that PR:
> > >>>>>> https://github.com/apache/incubator-beam/pull/1439.
> > >>>>>>
> > >>>>>> These work well but they are first shot. Do you have any comments
> > >>>>>> about
> > >>>>>> those?
> > >>>>>>
> > >>>>>> Besides I am not very sure that these files should be in the IO
> > itself
> > >>>>>> (even in contrib directory, out of maven source directories). Any
> > >>>>>>
> > >>>>> thoughts?
> > >>>>>
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Etienne
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> > >>>>>>
> > >>>>>>> It's great to hear more experiences.
> > >>>>>>>
> > >>>>>>> I'm also glad to hear that people see real value in the high
> > >>>>>>> volume/performance benchmark tests. I tried to capture that in
> the
> > >>>>>>>
> > >>>>>> Testing
> > >>>>>
> > >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> > >>>>>>>
> > >>>>>>> It does generally sound like we're in agreement here. Areas of
> > >>>>>>> discussion
> > >>>>>>>
> > >>>>>> I
> > >>>>>>
> > >>>>>>> see:
> > >>>>>>> 1.  People like the idea of bringing up fresh instances for each
> > test
> > >>>>>>> rather than keeping instances running all the time, since that
> > >>>>>>> ensures no
> > >>>>>>> contamination between tests. That seems reasonable to me. If we
> see
> > >>>>>>> flakiness in the tests or we note that setting up/tearing down
> > >>>>>>> instances
> > >>>>>>>
> > >>>>>> is
> > >>>>>>
> > >>>>>>> taking a lot of time,
> > >>>>>>> 2. Deciding on cluster management software/orchestration software
> > - I
> > >>>>>>>
> > >>>>>> want
> > >>>>>
> > >>>>>> to make sure we land on the right tool here since choosing the
> > >>>>>>> wrong tool
> > >>>>>>> could result in administration of the instances taking more
> work. I
> > >>>>>>>
> > >>>>>> suspect
> > >>>>>>
> > >>>>>>> that's a good place for a follow up discussion, so I'll start a
> > >>>>>>> separate
> > >>>>>>> thread on that. I'm happy with whatever tool we choose, but I
> want
> > to
> > >>>>>>>
> > >>>>>> make
> > >>>>>
> > >>>>>> sure we take a moment to consider different options and have a
> > >>>>>>> reason for
> > >>>>>>> choosing one.
> > >>>>>>>
> > >>>>>>> Etienne - thanks for being willing to port your creation/other
> > >>>>>>> scripts
> > >>>>>>> over. You might be a good early tester of whether this system
> works
> > >>>>>>> well
> > >>>>>>> for everyone.
> > >>>>>>>
> > >>>>>>> Stephen
> > >>>>>>>
> > >>>>>>> [1]  Reasons for Beam Test Strategy -
> > >>>>>>>
> > >>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> > >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
> > >>>
> > >>>>
> > >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
> > >>>>>>> <[email protected]>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> I second Etienne there.
> > >>>>>>>>
> > >>>>>>>> We worked together on the ElasticsearchIO and definitely, the
> high
> > >>>>>>>> valuable test we did were integration tests with ES on docker
> and
> > >>>>>>>> high
> > >>>>>>>> volume.
> > >>>>>>>>
> > >>>>>>>> I think we have to distinguish the two kinds of tests:
> > >>>>>>>> 1. utests are located in the IO itself and basically they should
> > >>>>>>>> cover
> > >>>>>>>> the core behaviors of the IO
> > >>>>>>>> 2. itests are located as contrib in the IO (they could be part
> of
> > >>>>>>>> the IO
> > >>>>>>>> but executed by the integration-test plugin or a specific
> profile)
> > >>>>>>>> that
> > >>>>>>>> deals with "real" backend and high volumes. The resources
> required
> > >>>>>>>> by
> > >>>>>>>> the itest can be bootstrapped by Jenkins (for instance using
> > >>>>>>>> Mesos/Marathon and docker images as already discussed, and it's
> > >>>>>>>> what I'm
> > >>>>>>>> doing on my own "server").
> > >>>>>>>>
> > >>>>>>>> It's basically what Stephen described.
> > >>>>>>>>
> > >>>>>>>> We have to not relay only on itest: utests are very important
> and
> > >>>>>>>> they
> > >>>>>>>> validate the core behavior.
> > >>>>>>>>
> > >>>>>>>> My $0.01 ;)
> > >>>>>>>>
> > >>>>>>>> Regards
> > >>>>>>>> JB
> > >>>>>>>>
> > >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Stephen,
> > >>>>>>>>>
> > >>>>>>>>> I like your proposition very much and I also agree that docker
> +
> > >>>>>>>>> some
> > >>>>>>>>> orchestration software would be great !
> > >>>>>>>>>
> > >>>>>>>>> On the elasticsearchIO (PR to be created this week) there is
> > docker
> > >>>>>>>>> container creation scripts and logstash data ingestion script
> for
> > >>>>>>>>> IT
> > >>>>>>>>> environment available in contrib directory alongside with
> > >>>>>>>>> integration
> > >>>>>>>>> tests themselves. I'll be happy to make them compliant to new
> IT
> > >>>>>>>>> environment.
> > >>>>>>>>>
> > >>>>>>>>> What you say bellow about the need for external IT environment
> is
> > >>>>>>>>> particularly true. As an example with ES what came out in first
> > >>>>>>>>> implementation was that there were problems starting at some
> high
> > >>>>>>>>>
> > >>>>>>>> volume
> > >>>>>
> > >>>>>> of data (timeouts, ES windowing overflow...) that could not have
> be
> > >>>>>>>>>
> > >>>>>>>> seen
> > >>>>>
> > >>>>>> on embedded ES version. Also there where some particularities to
> > >>>>>>>>> external instance like secondary (replica) shards that where
> not
> > >>>>>>>>>
> > >>>>>>>> visible
> > >>>>>
> > >>>>>> on embedded instance.
> > >>>>>>>>>
> > >>>>>>>>> Besides, I also favor bringing up instances before test because
> > it
> > >>>>>>>>> allows (amongst other things) to be sure to start on a fresh
> > >>>>>>>>> dataset
> > >>>>>>>>>
> > >>>>>>>> for
> > >>>>>
> > >>>>>> the test to be deterministic.
> > >>>>>>>>>
> > >>>>>>>>> Etienne
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
> > >>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> I'm excited we're getting lots of discussion going. There are
> > many
> > >>>>>>>>>> threads
> > >>>>>>>>>> of conversation here, we may choose to split some of them off
> > >>>>>>>>>> into a
> > >>>>>>>>>> different email thread. I'm also betting I missed some of the
> > >>>>>>>>>> questions in
> > >>>>>>>>>> this thread, so apologies ahead of time for that. Also
> apologies
> > >>>>>>>>>> for
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>>>
> > >>>>>>> amount of text, I provided some quick summaries at the top of
> each
> > >>>>>>>>>> section.
> > >>>>>>>>>>
> > >>>>>>>>>> Amit - thanks for your thoughts. I've responded in detail
> below.
> > >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of work
> > >>>>>>>>>> here to
> > >>>>>>>>>>
> > >>>>>>>>> go
> > >>>>>
> > >>>>>> around. I'll try and think about how we can divide up some next
> > >>>>>>>>>> steps
> > >>>>>>>>>> (probably in a separate thread.) The main next step I see is
> > >>>>>>>>>> deciding
> > >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm working
> on
> > >>>>>>>>>> that,
> > >>>>>>>>>>
> > >>>>>>>>> but
> > >>>>>>>>
> > >>>>>>>>> having lots of different thoughts on what the
> > >>>>>>>>>> advantages/disadvantages
> > >>>>>>>>>>
> > >>>>>>>>> of
> > >>>>>>>>
> > >>>>>>>>> those are would be helpful (I'm not entirely sure of the
> > >>>>>>>>>> protocol for
> > >>>>>>>>>> collaborating on sub-projects like this.)
> > >>>>>>>>>>
> > >>>>>>>>>> These issues are all related to what kind of tests we want to
> > >>>>>>>>>> write. I
> > >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all the
> use
> > >>>>>>>>>> cases
> > >>>>>>>>>> we've discussed here (and thus should not block moving forward
> > >>>>>>>>>> with
> > >>>>>>>>>> this),
> > >>>>>>>>>> but understanding what we want to test will help us understand
> > >>>>>>>>>> how the
> > >>>>>>>>>> cluster will be used. I'm working on a proposed user guide for
> > >>>>>>>>>> testing
> > >>>>>>>>>>
> > >>>>>>>>> IO
> > >>>>>>>>
> > >>>>>>>>> Transforms, and I'm going to send out a link to that + a short
> > >>>>>>>>>> summary
> > >>>>>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>
> > >>>>>>>>> the list shortly so folks can get a better sense of where I'm
> > >>>>>>>>>> coming
> > >>>>>>>>>> from.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Here's my thinking on the questions we've raised here -
> > >>>>>>>>>>
> > >>>>>>>>>> Embedded versions of data stores for testing
> > >>>>>>>>>> --------------------
> > >>>>>>>>>> Summary: yes! But we still need real data stores to test
> > against.
> > >>>>>>>>>>
> > >>>>>>>>>> I am a gigantic fan of using embedded versions of the various
> > data
> > >>>>>>>>>> stores.
> > >>>>>>>>>> I think we should test everything we possibly can using them,
> > >>>>>>>>>> and do
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>>>
> > >>>>>>> majority of our correctness testing using embedded versions + the
> > >>>>>>>>>>
> > >>>>>>>>> direct
> > >>>>>>
> > >>>>>>> runner. However, it's also important to have at least one test
> that
> > >>>>>>>>>> actually connects to an actual instance, so we can get
> coverage
> > >>>>>>>>>> for
> > >>>>>>>>>> things
> > >>>>>>>>>> like credentials, real connection strings, etc...
> > >>>>>>>>>>
> > >>>>>>>>>> The key point is that embedded versions definitely can't cover
> > the
> > >>>>>>>>>> performance tests, so we need to host instances if we want to
> > test
> > >>>>>>>>>>
> > >>>>>>>>> that.
> > >>>>>>
> > >>>>>>> I consider the integration tests/performance benchmarks to be
> > >>>>>>>>>> costly
> > >>>>>>>>>> things
> > >>>>>>>>>> that we do only for the IO transforms with large amounts of
> > >>>>>>>>>> community
> > >>>>>>>>>> support/usage. A random IO transform used by a few users
> doesn't
> > >>>>>>>>>> necessarily need integration & perf tests, but for heavily
> used
> > IO
> > >>>>>>>>>> transforms, there's a lot of community value in these tests.
> The
> > >>>>>>>>>> maintenance proposal below scales with the amount of community
> > >>>>>>>>>> support
> > >>>>>>>>>> for
> > >>>>>>>>>> a particular IO transform.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Reusing data stores ("use the data stores across executions.")
> > >>>>>>>>>> ------------------
> > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently used, very
> > >>>>>>>>>> small
> > >>>>>>>>>> instances that we keep up all the time + larger
> multi-container
> > >>>>>>>>>> data
> > >>>>>>>>>> store
> > >>>>>>>>>> instances that we spin up for perf tests.
> > >>>>>>>>>>
> > >>>>>>>>>> I don't think we need to have a strong answer to this
> question,
> > >>>>>>>>>> but I
> > >>>>>>>>>> think
> > >>>>>>>>>> we do need to know what range of capabilities we need, and use
> > >>>>>>>>>> that to
> > >>>>>>>>>> inform our requirements on the hosting infrastructure. I think
> > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios I
> > discuss
> > >>>>>>>>>>
> > >>>>>>>>> below.
> > >>>>>>
> > >>>>>>> I had been thinking of a hybrid approach - reuse some instances
> and
> > >>>>>>>>>>
> > >>>>>>>>> don't
> > >>>>>>>>
> > >>>>>>>>> reuse others. Some tests require isolation from other tests
> (eg.
> > >>>>>>>>>> performance benchmarking), while others can easily re-use the
> > same
> > >>>>>>>>>> database/data store instance over time, provided they are
> > >>>>>>>>>> written in
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>>>
> > >>>>>>> correct manner (eg. a simple read or write correctness
> integration
> > >>>>>>>>>>
> > >>>>>>>>> tests)
> > >>>>>>>>
> > >>>>>>>>> To me, the question of whether to use one instance over time
> for
> > a
> > >>>>>>>>>> test vs
> > >>>>>>>>>> spin up an instance for each test comes down to a trade off
> > >>>>>>>>>> between
> > >>>>>>>>>>
> > >>>>>>>>> these
> > >>>>>>>>
> > >>>>>>>>> factors:
> > >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super flaky,
> > >>>>>>>>>> we'll
> > >>>>>>>>>> want to
> > >>>>>>>>>> keep more instances up and running rather than bring them
> > up/down.
> > >>>>>>>>>>
> > >>>>>>>>> (this
> > >>>>>>
> > >>>>>>> may also vary by the data store in question)
> > >>>>>>>>>> 2. Frequency of testing - if we are running tests every 5
> > >>>>>>>>>> minutes, it
> > >>>>>>>>>>
> > >>>>>>>>> may
> > >>>>>>>>
> > >>>>>>>>> be wasteful to bring machines up/down every time. If we run
> > >>>>>>>>>> tests once
> > >>>>>>>>>>
> > >>>>>>>>> a
> > >>>>>>
> > >>>>>>> day or week, it seems wasteful to keep the machines up the whole
> > >>>>>>>>>> time.
> > >>>>>>>>>> 3. Isolation requirements - If tests must be isolated, it
> means
> > we
> > >>>>>>>>>>
> > >>>>>>>>> either
> > >>>>>>>>
> > >>>>>>>>> have to bring up the instances for each test, or we have to
> have
> > >>>>>>>>>> some
> > >>>>>>>>>> sort
> > >>>>>>>>>> of signaling mechanism to indicate that a given instance is in
> > >>>>>>>>>> use. I
> > >>>>>>>>>> strongly favor bringing up an instance per test.
> > >>>>>>>>>> 4. Number/size of containers - if we need a large number of
> > >>>>>>>>>> machines
> > >>>>>>>>>> for a
> > >>>>>>>>>> particular test, keeping them running all the time will use
> more
> > >>>>>>>>>> resources.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> The major unknown to me is how flaky it'll be to spin these
> up.
> > >>>>>>>>>> I'm
> > >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up, but I
> > >>>>>>>>>> think the
> > >>>>>>>>>> best
> > >>>>>>>>>> way to test that is to start doing it.
> > >>>>>>>>>>
> > >>>>>>>>>> I suspect the sweet spot is the following: have a set of very
> > >>>>>>>>>> small
> > >>>>>>>>>>
> > >>>>>>>>> data
> > >>>>>>
> > >>>>>>> store instances that stay up to support small-data-size
> post-commit
> > >>>>>>>>>> end to
> > >>>>>>>>>> end tests (post-commits run frequently and the data size means
> > the
> > >>>>>>>>>> instances would not use many resources), combined with the
> > >>>>>>>>>> ability to
> > >>>>>>>>>> spin
> > >>>>>>>>>> up larger instances for once a day/week performance benchmarks
> > >>>>>>>>>> (these
> > >>>>>>>>>>
> > >>>>>>>>> use
> > >>>>>>>>
> > >>>>>>>>> up more resources and are used less frequently.) That's the mix
> > >>>>>>>>>> I'll
> > >>>>>>>>>> propose in my docs on testing IO transforms.  If spinning up
> new
> > >>>>>>>>>> instances
> > >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of spinning up
> > >>>>>>>>>> instances
> > >>>>>>>>>> for
> > >>>>>>>>>> each test.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Management ("what's the overhead of managing such a
> deployment")
> > >>>>>>>>>> --------------------
> > >>>>>>>>>> Summary: I propose that anyone can contribute scripts for
> > >>>>>>>>>> setting up
> > >>>>>>>>>>
> > >>>>>>>>> data
> > >>>>>>>>
> > >>>>>>>>> store instances + integration/perf tests, but if the community
> > >>>>>>>>>> doesn't
> > >>>>>>>>>> maintain a particular data store's tests, we disable the tests
> > and
> > >>>>>>>>>> turn off
> > >>>>>>>>>> the data store instances.
> > >>>>>>>>>>
> > >>>>>>>>>> Management of these instances is a crucial question. First,
> > let's
> > >>>>>>>>>>
> > >>>>>>>>> break
> > >>>>>
> > >>>>>> down what tasks we'll need to do on a recurring basis:
> > >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both instance
> &
> > >>>>>>>>>> dependencies) - we don't want to have a lot of old versions
> > >>>>>>>>>> vulnerable
> > >>>>>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>
> > >>>>>>>>> attacks/buggy
> > >>>>>>>>>> 2. Investigate breakages/regressions
> > >>>>>>>>>> (I'm betting there will be more things we'll discover - let me
> > >>>>>>>>>> know if
> > >>>>>>>>>> you
> > >>>>>>>>>> have suggestions)
> > >>>>>>>>>>
> > >>>>>>>>>> There's a couple goals I see:
> > >>>>>>>>>> 1. We should only do sys admin work for things that give us a
> > >>>>>>>>>> lot of
> > >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up scripts
> for
> > >>>>>>>>>> data
> > >>>>>>>>>> stores
> > >>>>>>>>>> without a large community)
> > >>>>>>>>>> 2. We should do as much as possible of testing via
> > >>>>>>>>>> in-memory/embedded
> > >>>>>>>>>> testing (as you brought up).
> > >>>>>>>>>> 3. Reduce the amount of manual administration overhead
> > >>>>>>>>>>
> > >>>>>>>>>> As I discussed above, I think that integration
> tests/performance
> > >>>>>>>>>> benchmarks
> > >>>>>>>>>> are costly things that we should do only for the IO transforms
> > >>>>>>>>>> with
> > >>>>>>>>>>
> > >>>>>>>>> large
> > >>>>>>>>
> > >>>>>>>>> amounts of community support/usage. Thus, I propose that we
> > >>>>>>>>>> limit the
> > >>>>>>>>>>
> > >>>>>>>>> IO
> > >>>>>>
> > >>>>>>> transforms that get integration tests & performance benchmarks to
> > >>>>>>>>>>
> > >>>>>>>>> those
> > >>>>>
> > >>>>>> that have community support for maintaining the data store
> > >>>>>>>>>> instances.
> > >>>>>>>>>>
> > >>>>>>>>>> We can enforce this organically using some simple rules:
> > >>>>>>>>>> 1. Investigating breakages/regressions: if a given
> > >>>>>>>>>> integration/perf
> > >>>>>>>>>>
> > >>>>>>>>> test
> > >>>>>>
> > >>>>>>> starts failing and no one investigates it within a set period of
> > >>>>>>>>>> time
> > >>>>>>>>>>
> > >>>>>>>>> (a
> > >>>>>>
> > >>>>>>> week?), we disable the tests and shut off the data store
> > >>>>>>>>>> instances if
> > >>>>>>>>>>
> > >>>>>>>>> we
> > >>>>>>
> > >>>>>>> have instances running. When someone wants to step up and
> > >>>>>>>>>> support it
> > >>>>>>>>>> again,
> > >>>>>>>>>> they can fix the test, check it in, and re-enable the test.
> > >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira issue that
> > >>>>>>>>>> is just
> > >>>>>>>>>> "is
> > >>>>>>>>>> the IO Transform X data store up to date?" - if the jira is
> not
> > >>>>>>>>>> resolved in
> > >>>>>>>>>> a set period of time (1 month?), the perf/integration tests
> are
> > >>>>>>>>>>
> > >>>>>>>>> disabled,
> > >>>>>>>>
> > >>>>>>>>> and the data store instances shut off.
> > >>>>>>>>>>
> > >>>>>>>>>> This is pretty flexible -
> > >>>>>>>>>> * If a particular person or organization wants to support an
> IO
> > >>>>>>>>>> transform,
> > >>>>>>>>>> they can. If a group of people all organically organize to
> keep
> > >>>>>>>>>> the
> > >>>>>>>>>>
> > >>>>>>>>> tests
> > >>>>>>>>
> > >>>>>>>>> running, they can.
> > >>>>>>>>>> * It can be mostly automated - there's not a lot of central
> > >>>>>>>>>> organizing
> > >>>>>>>>>> work
> > >>>>>>>>>> that needs to be done.
> > >>>>>>>>>>
> > >>>>>>>>>> Exposing the information about what IO transforms currently
> have
> > >>>>>>>>>>
> > >>>>>>>>> running
> > >>>>>>
> > >>>>>>> IT/perf benchmarks on the website will let users know what IO
> > >>>>>>>>>>
> > >>>>>>>>> transforms
> > >>>>>>
> > >>>>>>> are well supported.
> > >>>>>>>>>>
> > >>>>>>>>>> I like this solution, but I also recognize this is a tricky
> > >>>>>>>>>> problem.
> > >>>>>>>>>>
> > >>>>>>>>> This
> > >>>>>>>>
> > >>>>>>>>> is something the community needs to be supportive of, so I'm
> > >>>>>>>>>> open to
> > >>>>>>>>>> other
> > >>>>>>>>>> thoughts.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Simulating failures in real nodes ("programmatic tests to
> > simulate
> > >>>>>>>>>> failure")
> > >>>>>>>>>> -----------------
> > >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We should
> > >>>>>>>>>> encourage a
> > >>>>>>>>>> design pattern separating out network/retry logic from the
> main
> > IO
> > >>>>>>>>>> transform logic
> > >>>>>>>>>>
> > >>>>>>>>>> We *could* create instance failure in any container management
> > >>>>>>>>>>
> > >>>>>>>>> software
> > >>>>>
> > >>>>>> -
> > >>>>>>>>
> > >>>>>>>>> we can use their programmatic APIs to determine which
> containers
> > >>>>>>>>>> are
> > >>>>>>>>>> running the instances, and ask them to kill the container in
> > >>>>>>>>>> question.
> > >>>>>>>>>>
> > >>>>>>>>> A
> > >>>>>>
> > >>>>>>> slow node would be trickier, but I'm sure we could figure it out
> > >>>>>>>>>> - for
> > >>>>>>>>>> example, add a network proxy that would delay responses.
> > >>>>>>>>>>
> > >>>>>>>>>> However, I would argue that this type of testing doesn't gain
> > us a
> > >>>>>>>>>> lot, and
> > >>>>>>>>>> is complicated to set up. I think it will be easier to test
> > >>>>>>>>>> network
> > >>>>>>>>>> errors
> > >>>>>>>>>> and retry behavior in unit tests for the IO transforms.
> > >>>>>>>>>>
> > >>>>>>>>>> Part of the way to handle this is to separate out the read
> code
> > >>>>>>>>>> from
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>>>
> > >>>>>>> network code (eg. bigtable has BigtableService). If you put the
> > >>>>>>>>>>
> > >>>>>>>>> "handle
> > >>>>>
> > >>>>>> errors/retry logic" code in a separate MySourceService class,
> > >>>>>>>>>> you can
> > >>>>>>>>>> test
> > >>>>>>>>>> MySourceService on the wide variety of networks errors/data
> > store
> > >>>>>>>>>> problems,
> > >>>>>>>>>> and then your main IO transform tests focus on the read
> behavior
> > >>>>>>>>>> and
> > >>>>>>>>>> handling the small set of errors the MySourceService class
> will
> > >>>>>>>>>>
> > >>>>>>>>> return.
> > >>>>>
> > >>>>>> I also think we should focus on testing the IO Transform, not
> > >>>>>>>>>> the data
> > >>>>>>>>>> store - if we kill a node in a data store, it's that data
> > store's
> > >>>>>>>>>> problem,
> > >>>>>>>>>> not beam's problem. As you were pointing out, there are a
> > *large*
> > >>>>>>>>>> number of
> > >>>>>>>>>> possible ways that a particular data store can fail, and we
> > >>>>>>>>>> would like
> > >>>>>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>
> > >>>>>>>>> support many different data stores. Rather than try to test
> that
> > >>>>>>>>>> each
> > >>>>>>>>>> data
> > >>>>>>>>>> store behaves well, we should ensure that we handle
> > >>>>>>>>>> generic/expected
> > >>>>>>>>>> errors
> > >>>>>>>>>> in a graceful manner.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Ismaeal had a couple other quick comments/questions, I'll
> answer
> > >>>>>>>>>> here
> > >>>>>>>>>>
> > >>>>>>>>> -
> > >>>>>
> > >>>>>> We can use this to test other runners running on multiple
> > >>>>>>>>>> machines - I
> > >>>>>>>>>> agree. This is also necessary for a good performance benchmark
> > >>>>>>>>>> test.
> > >>>>>>>>>>
> > >>>>>>>>>> "providing the test machines to mount the cluster" - we can
> > >>>>>>>>>> discuss
> > >>>>>>>>>>
> > >>>>>>>>> this
> > >>>>>>
> > >>>>>>> further, but one possible option is that google may be willing to
> > >>>>>>>>>>
> > >>>>>>>>> donate
> > >>>>>>
> > >>>>>>> something to support this.
> > >>>>>>>>>>
> > >>>>>>>>>> "IO Consistency" - let's follow up on those questions in
> another
> > >>>>>>>>>>
> > >>>>>>>>> thread.
> > >>>>>>
> > >>>>>>> That's as much about the public interface we provide to users as
> > >>>>>>>>>>
> > >>>>>>>>> anything
> > >>>>>>>>
> > >>>>>>>>> else. I agree with your sentiment that a user should be able to
> > >>>>>>>>>> expect
> > >>>>>>>>>> predictable behavior from the different IO transforms.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for everyone's questions/comments - I really am excited
> > >>>>>>>>>> to see
> > >>>>>>>>>> that
> > >>>>>>>>>> people care about this :)
> > >>>>>>>>>>
> > >>>>>>>>>> Stephen
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <
> [email protected]
> > >
> > >>>>>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>
> > >>>>>> Hello,
> > >>>>>>>>>>>
> > >>>>>>>>>>> @Stephen Thanks for your proposal, it is really interesting,
> I
> > >>>>>>>>>>> would
> > >>>>>>>>>>> really
> > >>>>>>>>>>> like to help with this. I have never played with Kubernetes
> but
> > >>>>>>>>>>> this
> > >>>>>>>>>>> seems
> > >>>>>>>>>>> a really nice chance to do something useful with it.
> > >>>>>>>>>>>
> > >>>>>>>>>>> We (at Talend) are testing most of the IOs using simple
> > container
> > >>>>>>>>>>>
> > >>>>>>>>>> images
> > >>>>>>>>
> > >>>>>>>>> and in some particular cases ‘clusters’ of containers using
> > >>>>>>>>>>> docker-compose
> > >>>>>>>>>>> (a little bit like Amit’s (2) proposal). It would be really
> > >>>>>>>>>>> nice to
> > >>>>>>>>>>>
> > >>>>>>>>>> have
> > >>>>>>>>
> > >>>>>>>>> this at the Beam level, in particular to try to test more
> complex
> > >>>>>>>>>>> semantics, I don’t know how programmable kubernetes is to
> > achieve
> > >>>>>>>>>>> this for
> > >>>>>>>>>>> example:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Let’s think we have a cluster of Cassandra or Kafka nodes, I
> > >>>>>>>>>>> would
> > >>>>>>>>>>> like to
> > >>>>>>>>>>> have programmatic tests to simulate failure (e.g. kill a
> node),
> > >>>>>>>>>>> or
> > >>>>>>>>>>> simulate
> > >>>>>>>>>>> a really slow node, to ensure that the IO behaves as expected
> > >>>>>>>>>>> in the
> > >>>>>>>>>>> Beam
> > >>>>>>>>>>> pipeline for the given runner.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Another related idea is to improve IO consistency: Today the
> > >>>>>>>>>>> different IOs
> > >>>>>>>>>>> have small differences in their failure behavior, I really
> > >>>>>>>>>>> would like
> > >>>>>>>>>>> to be
> > >>>>>>>>>>> able to predict with more precision what will happen in case
> of
> > >>>>>>>>>>>
> > >>>>>>>>>> errors,
> > >>>>>>
> > >>>>>>> e.g. what is the correct behavior if I am writing to a Kafka
> > >>>>>>>>>>> node and
> > >>>>>>>>>>> there
> > >>>>>>>>>>> is a network partition, does the Kafka sink retries or no ?
> and
> > >>>>>>>>>>> what
> > >>>>>>>>>>> if it
> > >>>>>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming
> > >>>>>>>>>>> checkpointing?
> > >>>>>>>>>>> Or do
> > >>>>>>>>>>> we guarantee exactly once writes somehow?, today I am not
> sure
> > >>>>>>>>>>> about
> > >>>>>>>>>>> what
> > >>>>>>>>>>> happens (or if the expected behavior depends on the runner),
> > >>>>>>>>>>> but well
> > >>>>>>>>>>> maybe
> > >>>>>>>>>>> it is just that I don’t know and we have tests to ensure
> this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Of course both are really hard problems, but I think with
> your
> > >>>>>>>>>>> proposal we
> > >>>>>>>>>>> can try to tackle them, as well as the performance ones. And
> > >>>>>>>>>>> apart of
> > >>>>>>>>>>> the
> > >>>>>>>>>>> data stores, I think it will be also really nice to be able
> to
> > >>>>>>>>>>> test
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>
> > >>>>>>> runners in a distributed manner.
> > >>>>>>>>>>>
> > >>>>>>>>>>> So what is the next step? How do you imagine such integration
> > >>>>>>>>>>> tests?
> > >>>>>>>>>>> ? Who
> > >>>>>>>>>>> can provide the test machines so we can mount the cluster?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Maybe my ideas are a bit too far away for an initial setup,
> but
> > >>>>>>>>>>> it
> > >>>>>>>>>>> will be
> > >>>>>>>>>>> really nice to start working on this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Ismael
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <
> > >>>>>>>>>>> [email protected]
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Stephen,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I was wondering about how we plan to use the data stores
> > across
> > >>>>>>>>>>>>
> > >>>>>>>>>>> executions.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Clearly, it's best to setup a new instance (container) for
> > every
> > >>>>>>>>>>>>
> > >>>>>>>>>>> test,
> > >>>>>>
> > >>>>>>> running a "standalone" store (say HBase/Cassandra for
> > >>>>>>>>>>>> example), and
> > >>>>>>>>>>>> once
> > >>>>>>>>>>>> the test is done, teardown the instance. It should also be
> > >>>>>>>>>>>> agnostic
> > >>>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>>>>
> > >>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>>> runtime environment (e.g., Docker on Kubernetes).
> > >>>>>>>>>>>> I'm wondering though what's the overhead of managing such a
> > >>>>>>>>>>>>
> > >>>>>>>>>>> deployment
> > >>>>>>
> > >>>>>>> which could become heavy and complicated as more IOs are
> > >>>>>>>>>>>> supported
> > >>>>>>>>>>>>
> > >>>>>>>>>>> and
> > >>>>>>
> > >>>>>>> more
> > >>>>>>>>>>>
> > >>>>>>>>>>>> test cases introduced.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Another way to go would be to have small clusters of
> different
> > >>>>>>>>>>>> data
> > >>>>>>>>>>>>
> > >>>>>>>>>>> stores
> > >>>>>>>>>>>
> > >>>>>>>>>>>> and run against new "namespaces" (while lazily evicting old
> > >>>>>>>>>>>> ones),
> > >>>>>>>>>>>> but I
> > >>>>>>>>>>>> think this is less likely as maintaining a distributed
> > instance
> > >>>>>>>>>>>>
> > >>>>>>>>>>> (even
> > >>>>>
> > >>>>>> a
> > >>>>>>>>
> > >>>>>>>>> small one) for each data store sounds even more complex.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A third approach would be to to simply have an "embedded"
> > >>>>>>>>>>>> in-memory
> > >>>>>>>>>>>> instance of a data store as part of a test that runs against
> > it
> > >>>>>>>>>>>> (such as
> > >>>>>>>>>>>>
> > >>>>>>>>>>> an
> > >>>>>>>>>>>
> > >>>>>>>>>>>> embedded Kafka, though not a data store).
> > >>>>>>>>>>>> This is probably the simplest solution in terms of
> > >>>>>>>>>>>> orchestration,
> > >>>>>>>>>>>> but it
> > >>>>>>>>>>>> depends on having a proper "embedded" implementation for an
> > IO.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Does this make sense to you ? have you considered it ?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>> Amit
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <
> > >>>>>>>>>>>>
> > >>>>>>>>>>> [email protected]
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi Stephen,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> as already discussed a bit together, it sounds great !
> > >>>>>>>>>>>>> Especially I
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>>> it as a both integration test platform and good coverage for
> > >>>>>>>>>>>>> IOs.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'm very late on this but, as said, I will share with you
> my
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> Marathon
> > >>>>>>
> > >>>>>>> JSON and Mesos docker images.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> By the way, I started to experiment a bit kubernetes and
> > >>>>>>>>>>>>> swamp but
> > >>>>>>>>>>>>> it's
> > >>>>>>>>>>>>> not yet complete. I will share what I have on the same
> github
> > >>>>>>>>>>>>> repo.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks !
> > >>>>>>>>>>>>> Regards
> > >>>>>>>>>>>>> JB
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi everyone!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Currently we have a good set of unit tests for our IO
> > >>>>>>>>>>>>>> Transforms -
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> those
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> tend to run against in-memory versions of the data stores.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> However,
> > >>>>>
> > >>>>>> we'd
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> like to further increase our test coverage to include
> > >>>>>>>>>>>>>> running them
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> against
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> real instances of the data stores that the IO Transforms
> > work
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> against
> > >>>>>>>>
> > >>>>>>>>> (e.g.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> cassandra, mongodb, kafka, etc…), which means we'll need
> to
> > >>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> real
> > >>>>>>>>
> > >>>>>>>>> instances of various data stores.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Additionally, if we want to do performance regression
> > >>>>>>>>>>>>>> detection,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> it's
> > >>>>>>>>
> > >>>>>>>>> important to have instances of the services that behave
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> realistically,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> which isn't true of in-memory or dev versions of the
> services.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Proposed solution
> > >>>>>>>>>>>>>> -------------------------
> > >>>>>>>>>>>>>> If we accept this proposal, we would create an
> > >>>>>>>>>>>>>> infrastructure for
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> running
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> real instances of data stores inside of containers, using
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> container
> > >>>>>
> > >>>>>> management software like mesos/marathon, kubernetes, docker
> > >>>>>>>>>>>>>> swarm,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> etc…
> > >>>>>>>>>>>
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> manage the instances.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> This would enable us to build integration tests that run
> > >>>>>>>>>>>>>> against
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> those
> > >>>>>>>>>>>
> > >>>>>>>>>>>> real
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> instances and performance tests that run against those
> real
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> instances
> > >>>>>>>>
> > >>>>>>>>> (like
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> those that Jason Kuster is proposing elsewhere.)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Why do we need one centralized set of instances vs just
> > having
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> various
> > >>>>>>>>>>>
> > >>>>>>>>>>>> people host their own instances?
> > >>>>>>>>>>>>>> -------------------------
> > >>>>>>>>>>>>>> Reducing flakiness of tests is key. By not having
> > dependencies
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> from
> > >>>>>
> > >>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>>> core project on external services/instances of data stores
> > >>>>>>>>>>>>>> we have
> > >>>>>>>>>>>>>> guaranteed access to the services and the group can fix
> > issues
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> that
> > >>>>>
> > >>>>>> arise.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> An exception would be something that has an ops team
> > >>>>>>>>>>>>>> supporting it
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> (eg,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> AWS, Google Cloud or other professionally managed service) -
> > >>>>>>>>>>>>>> those
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> we
> > >>>>>>>>
> > >>>>>>>>> trust
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> will be stable.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> There may be a lot of different data stores needed - how
> > >>>>>>>>>>>>>> will we
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> maintain
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> them?
> > >>>>>>>>>>>>>> -------------------------
> > >>>>>>>>>>>>>> It will take work above and beyond that of a normal set of
> > >>>>>>>>>>>>>> unit
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> tests
> > >>>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> build and maintain integration/performance tests & their
> data
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> store
> > >>>>>
> > >>>>>> instances.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Setup & maintenance of the data store containers and data
> > >>>>>>>>>>>>>> store
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> instances
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> on it must be automated. It also has to be as simple of a
> > >>>>>>>>>>>>>> setup as
> > >>>>>>>>>>>>>> possible, and we should avoid hand tweaking the
> containers -
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> expecting
> > >>>>>>>>>>>
> > >>>>>>>>>>>> checked in scripts/dockerfiles is key.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Aligned with the community ownership approach of Apache,
> as
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> members
> > >>>>>
> > >>>>>> of
> > >>>>>>>>>>>
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> community are excited to contribute & maintain those tests
> > >>>>>>>>>>>>>> and the
> > >>>>>>>>>>>>>> integration/performance tests, people will be able to step
> > >>>>>>>>>>>>>> up and
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> do
> > >>>>>>
> > >>>>>>> that.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> If there is no longer support for maintaining a particular
> > >>>>>>>>>>>>>> set of
> > >>>>>>>>>>>>>> integration & performance tests and their data store
> > >>>>>>>>>>>>>> instances,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> then
> > >>>>>>
> > >>>>>>> we
> > >>>>>>>>>>>
> > >>>>>>>>>>>> can
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> disable those tests. We may document on the website what
> IO
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> Transforms
> > >>>>>>>>>>>
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> current integration/performance tests so users know what
> > >>>>>>>>>>>>>> level of
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> testing
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> the various IO Transforms have.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> What about requirements for the container management
> > software
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> itself?
> > >>>>>>>>
> > >>>>>>>>> -------------------------
> > >>>>>>>>>>>>>> * We should have the data store instances themselves in
> > >>>>>>>>>>>>>> Docker.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> Docker
> > >>>>>>>>>>>
> > >>>>>>>>>>>> allows new instances to be spun up in a quick, reproducible
> > way
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> and
> > >>>>>
> > >>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>>> fairly platform independent. It has wide support from a
> > >>>>>>>>>>>>>> variety of
> > >>>>>>>>>>>>>> different container management services.
> > >>>>>>>>>>>>>> * As little admin work required as possible. Crashing
> > >>>>>>>>>>>>>> instances
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> should
> > >>>>>>>>>>>
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> restarted, setup should be simple, everything possible
> > >>>>>>>>>>>>>> should be
> > >>>>>>>>>>>>>> scripted/scriptable.
> > >>>>>>>>>>>>>> * Logs and test output should be on a publicly available
> > >>>>>>>>>>>>>> website,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> without
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> needing to log into test execution machine. Centralized
> > >>>>>>>>>>>>>> capture of
> > >>>>>>>>>>>>>> monitoring info/logs from instances running in the
> > containers
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> would
> > >>>>>
> > >>>>>> support
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> this. Ideally, this would just be supported by the
> container
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> software
> > >>>>>>>>
> > >>>>>>>>> out
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> of the box.
> > >>>>>>>>>>>>>> * It'd be useful to have good persistent volume in the
> > >>>>>>>>>>>>>> container
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> management
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> software so that databases don't have to reload large data
> > >>>>>>>>>>>>>> sets
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> every
> > >>>>>>>>
> > >>>>>>>>> time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> * The containers may be a place to execute runners
> > >>>>>>>>>>>>>> themselves if
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> we
> > >>>>>
> > >>>>>> need
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> larger runner instances, so it should play well with Spark,
> > >>>>>>>>>>>>>> Flink,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> etc…
> > >>>>>>>>>>>
> > >>>>>>>>>>>> As I discussed earlier on the mailing list, it looks like
> > >>>>>>>>>>>>>> hosting
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> docker
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> containers on kubernetes, docker swarm or mesos+marathon
> > >>>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> a
> > >>>>>
> > >>>>>> good
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> solution.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>> Stephen Sisk
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> --
> > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>>>>>> [email protected]
> > >>>>>>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>> [email protected]
> > >>>>>>>> http://blog.nanthrax.net
> > >>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>
> > >>>>>>>>
> > >>>
> > >
> > >
> >
>

Re: Hosting data stores for IO Transform testing

Reply via email to