Re: Fwd: [CI] What are the troubles projects face with CI and Infra

David Capwell Mon, 03 Feb 2020 18:06:34 -0800

Following Mick's format =)

** Lack of trust (aka reliability)


Mick said it best, but should also add that we have slow tests and tests
which don't do anything.  Effort is needed to improve our current tests and
to make sure future tests are stable (cleaning up works, isolation, etc.);
this is not a neglectable amount of work, nor work which can be done by a
single person.

** Lack of resources (throughput and response)

Our slowest unit tests are around 2 minutes (materialized views), our
slowest dtests (not high resource) are around 30 minutes; given enough
resources we could run unit in < 10 minutes and dtest in 30-60 minutes.

There is also another thing to point out, testing is also a combinatorics
problem; we support java 8/11 (more to come), vnode and no-vnode, security
and no security, and the list goes on.  Bugs are more likely to happen when
two features interact, so it is important to test against many combinations.

There is work going on in the community to add new kinds of tests (harry,
diff, etc.); these tests require even more resources than normal tests.

** Difficulty in use

Many people rely on CircleCI as the core CI for the project, but this has a
few issues called out in other forms: the low resource version (free) is
even more flaky than high (paid), and people get locked out (i have lost
access twice so far, others have said the same).

The thing which worries me the most is that new members to the project
won't have the high resource CircleCI plan, nor do they really have access
to Jenkins.  This puts a burden on new authors where they wait 24+ hours to
run the tests... or just not run them.

** Lack of visibility into quality

This is two things for me: commit and pre-commit.

For commit, this is more what Mick was referring to as "post-commit CI".
There are a few questions I would like to know about our current tests
(report most flaky tests, which sections of code cause the most failures,
etc.); these are hard to answer at the moment .

We don't have a good pre-commit story since it mostly relies on CircleCI.
I find that some JIRAs link CircleCI and some don't.  I find that if I
follow the CircleCI link months later (to see if the build was stable
pre-commit) that Circle fails to show the workflow.

On Mon, Feb 3, 2020 at 3:42 PM Michael Shuler <mich...@pbandjelly.org>
wrote:

> Only have a moment to respond, but Mick hit the higlights with
> containerization, parallelization, these help solve cleanup, speed, and
> cascading failures. Dynamic disposable slaves would be icing on that
> cake, which may require a dedicated master.
>
> One more note on jobs, or more correctly unnecessary jobs - pipelines
> have a `changeset` build condition we should tinker with. There is zero
> reason to run a job with no actual code diff. For instance, I committed
> to 2.1 this morning and merged `-s ours` nothing to the newer branches -
> there's really no reason to run and take up valuable resources with no
> actual diff changes.
> https://jenkins.io/doc/book/pipeline/syntax/#built-in-conditions
>
> Michael
>
> On 2/3/20 3:45 PM, Nate McCall wrote:
> > Mick, this is fantastic!
> >
> > I'll wait another day to see if anyone else chimes in. (Would also love
> to
> > hear from CassCI folks, anyone else really who has wrestled with this
> even
> > for internal forks).
> >
> > On Tue, Feb 4, 2020 at 10:37 AM Mick Semb Wever <m...@apache.org> wrote:
> >
> >> Nate, I leave it to you to forward what-you-chose to the board@'s
> thread.
> >>
> >>
> >>> Are there still troubles and what are they?
> >>
> >>
> >> TL;DR
> >>    the ASF could provide the Cassandra community with an isolated
> jenkins
> >> installation: so that we can manage and control the Jenkins master,  as
> >> well as ensure all donated hardware for Jenkins agents are dedicated and
> >> isolated to us.
> >>
> >>
> >> The long writeup…
> >>
> >> For Cassandra's use of ASF's Jenkins I see the following problems.
> >>
> >> ** Lack of trust (aka reliability)
> >>
> >> The Jenkins agents re-use their workspaces, as opposed to using new
> >> containers per test run, leading to broken agents, disks, git clones,
> etc.
> >> One broken test run, or a broken agent, too easily affects subsequent
> test
> >> executions.
> >>
> >> The complexity (and flakiness) around our tests is a real problem.  CI
> on
> >> a project like Cassandra is a beast and the community is very limited in
> >> what it can do, it really needs the help of larger companies. Effort is
> >> required in fixing the broken, the flakey, and the ignored tests.
> >> Parallelising the tests will help by better isolating failures, but
> tests
> >> (and their execution scripts) also need to be better at cleaning up
> after
> >> themselves, or a more container approach needs to be taken.
> >>
> >> Another issue is that other projects sometimes using the agents, and
> Infra
> >> sometimes edits our build configurations (out of necessity).
> >>
> >>
> >> ** Lack of resources (throughput and response)
> >>
> >> Having only 9 agents: none of which can run the large dtests; is a
> >> problem. All 9 are from Instaclustr, much kudos! Three companies
> recently
> >> have said they will donate resources, this is work in progress.
> >>
> >> We have four release branches where we would like to provide per-commit
> >> post-commit testing. Each complete test execution currently take 24hr+.
> >> Parallelising tests atm won't help much as the agents are generally
> >> saturated (with the pipelines doing the top-level parallelisation).
> Once we
> >> get more hardware in place: for the sake of improving throughput; it
> will
> >> make sense to look into parallelising the tests more.
> >>
> >> The throughput of tests will also improve with effort put into
> >> removing/rewriting long running and inefficient tests. Also, and i think
> >> this is LHF, throughput could be improved by using (or taking
> inspiration
> >> from) Apache Yetus so to only run tests on what it relevant in the
> >> patch/commit. Ref:
> >> http://yetus.apache.org/documentation/0.11.1/precommit-basic/
> >>
> >>
> >> ** Difficulty in use
> >>
> >> Jenkins is clumsy to use compared to the CI systems we use more often
> >> today: Travis, CircleCI, GH Actions.
> >>
> >> One of the complaints has been that only committers can kick off CI for
> >> patches (ie pre-commit CI runs).  But I don't believe this to be a
> crucial
> >> issue for a number of reasons.
> >>
> >> 1. Thorough CI testing of a patch only needs to happen during the review
> >> process, to which a committer needs to be involved in anyway.
> >> 2.  We don't have enough jenkins agents to handle the amount of
> throughput
> >> that automated branch/patch/pull-request testing would require.
> >> 3. Our tests could allow unknown contributors to take ownership of the
> >> agent servers (eg via the execution of bash scripts).
> >> 4. We have CircleCI working that provides basic testing for
> >> work-in-progress patches.
> >>
> >>
> >> Focusing on post-commit CI and having canonical results for our release
> >> branches, i think then it boils down to the stability and throughput of
> >> tests, and the persistence and permanence of results.
> >>
> >> The persistence and permanence of results is a bug bear for me. It has
> >> been partially addressed with posting the build results to the builds@
> >> ML. But this only provides a (pretty raw) summary of the results. I'm
> keen
> >> to take the next step of the posting of CI results back to committed
> jira
> >> tickets (but am waiting on seeing Jenkins run stable for a while).  If
> we
> >> had our own Jenkins master we could then look into retaining more/all
> build
> >> results. Being able to see the longer term trends of test results and
> well
> >> as execution times I hope would add the incentive to get more folk
> involved.
> >>
> >> Looping back to the ASF and what they could do: it would help us a lot
> in
> >> improving the stability and usability issues by providing us an isolated
> >> jenkins. Having our own master would simplify the setup, use and
> debugging,
> >> of Jenkins. It would still require some sunk cost but hopefully we'd
> end up
> >> with something better tailored to our needs. And with isolated agents
> help
> >> restore confidence.
> >>
> >> regards,
> >> Mick
> >>
> >> PS i really want to hear from those that were involved in the past with
> >> cassci, your skills and experience on this topic surpass anything i got.
> >>
> >>
> >>
> >> On Sun, 2 Feb 2020, at 22:51, Nate McCall wrote:
> >>> Hi folks,
> >>> The board is looking for feedback on CI infrastructure. I'm happy to
> take
> >>> some (constructive) comments back. (Shuler, Mick and David Capwell
> >>> specifically as folks who've most recently wrestled with this a fair
> >> bit).
> >>>
> >>> Thanks,
> >>> -Nate
> >>>
> >>> ---------- Forwarded message ---------
> >>> From: Dave Fisher <w...@apache.org>
> >>> Date: Mon, Feb 3, 2020 at 8:58 AM
> >>> Subject: [CI] What are the troubles projects face with CI and Infra
> >>> To: Apache Board <bo...@apache.org>
> >>>
> >>>
> >>> Hi -
> >>>
> >>> It has come to the attention of the board through looking at past board
> >>> reports that some projects are having problems with CI infrastructure.
> >>>
> >>> Are there still troubles and what are they?
> >>>
> >>> Regards,
> >>> Dave
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: Fwd: [CI] What are the troubles projects face with CI and Infra

Reply via email to