Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-10 Thread Tee
Another thing worth mentioning is the overhead with providing, setting up, and 
managing infrastructure. While donations of servers/VM’s works to some extent, 
it’s a somewhat archaic donation policy – and this whole structure of CI gives 
both the ASF and the C* project very limited choices. From my perspective it 
would be much more suitable for the C* project to receive financial or credit 
based donations, on which we could utilise against whatever CI/CD we desired. 
This would open up modern options for CI and not limit us to just ASF-ran 
Jenkins. I’m sure this would be much more suitable for many smaller 
organisations, with possible tax benefits as well, rather than providing 
infrastructure (which lets face it, is rarely bare metal servers these days) 
that they have to maintain.

For example, It’d be ideal if there was a C* CircleCI account (or deployment) 
that the project could use that could be funded by the community, rather than 
the big backers being the only ones able to run the tests effectively (and thus 
everyone relying on them paying for and doing it for them).


On 2020/02/02 21:51:54, Nate McCall  wrote: 
> Hi folks,
> The board is looking for feedback on CI infrastructure. I'm happy to take
> some (constructive) comments back. (Shuler, Mick and David Capwell
> specifically as folks who've most recently wrestled with this a fair bit).
> 
> Thanks,
> -Nate
> 
> -- Forwarded message -
> From: Dave Fisher 
> Date: Mon, Feb 3, 2020 at 8:58 AM
> Subject: [CI] What are the troubles projects face with CI and Infra
> To: Apache Board 
> 
> 
> Hi -
> 
> It has come to the attention of the board through looking at past board
> reports that some projects are having problems with CI infrastructure.
> 
> Are there still troubles and what are they?
> 
> Regards,
> Dave
> 
Sent from Mail for Windows 10



Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-08 Thread Rahul Singh
Related to instances, can we get those credits put to use that Amazon promised 
to give back to the community as part of their Amazon Managed Cassandra Service 
announcement?

Alternatively if there is an appetite to set something in patreon or GitHub’s 
donation platform , it may be a good way to get the things we need funded based 
on what the community wants —- business driven demand.

Thoughts?

rahul.xavier.si...@gmail.com

http://cassandra.link
The Apache Cassandra Knowledge Base.
On Feb 3, 2020, 9:06 PM -0500, David Capwell , wrote:
> Following Mick's format =)
>
> ** Lack of trust (aka reliability)
>
> Mick said it best, but should also add that we have slow tests and tests
> which don't do anything. Effort is needed to improve our current tests and
> to make sure future tests are stable (cleaning up works, isolation, etc.);
> this is not a neglectable amount of work, nor work which can be done by a
> single person.
>
> ** Lack of resources (throughput and response)
>
> Our slowest unit tests are around 2 minutes (materialized views), our
> slowest dtests (not high resource) are around 30 minutes; given enough
> resources we could run unit in < 10 minutes and dtest in 30-60 minutes.
>
> There is also another thing to point out, testing is also a combinatorics
> problem; we support java 8/11 (more to come), vnode and no-vnode, security
> and no security, and the list goes on. Bugs are more likely to happen when
> two features interact, so it is important to test against many combinations.
>
> There is work going on in the community to add new kinds of tests (harry,
> diff, etc.); these tests require even more resources than normal tests.
>
> ** Difficulty in use
>
> Many people rely on CircleCI as the core CI for the project, but this has a
> few issues called out in other forms: the low resource version (free) is
> even more flaky than high (paid), and people get locked out (i have lost
> access twice so far, others have said the same).
>
> The thing which worries me the most is that new members to the project
> won't have the high resource CircleCI plan, nor do they really have access
> to Jenkins. This puts a burden on new authors where they wait 24+ hours to
> run the tests... or just not run them.
>
> ** Lack of visibility into quality
>
> This is two things for me: commit and pre-commit.
>
> For commit, this is more what Mick was referring to as "post-commit CI".
> There are a few questions I would like to know about our current tests
> (report most flaky tests, which sections of code cause the most failures,
> etc.); these are hard to answer at the moment .
>
> We don't have a good pre-commit story since it mostly relies on CircleCI.
> I find that some JIRAs link CircleCI and some don't. I find that if I
> follow the CircleCI link months later (to see if the build was stable
> pre-commit) that Circle fails to show the workflow.
>
> On Mon, Feb 3, 2020 at 3:42 PM Michael Shuler 
> wrote:
>
> > Only have a moment to respond, but Mick hit the higlights with
> > containerization, parallelization, these help solve cleanup, speed, and
> > cascading failures. Dynamic disposable slaves would be icing on that
> > cake, which may require a dedicated master.
> >
> > One more note on jobs, or more correctly unnecessary jobs - pipelines
> > have a `changeset` build condition we should tinker with. There is zero
> > reason to run a job with no actual code diff. For instance, I committed
> > to 2.1 this morning and merged `-s ours` nothing to the newer branches -
> > there's really no reason to run and take up valuable resources with no
> > actual diff changes.
> > https://jenkins.io/doc/book/pipeline/syntax/#built-in-conditions
> >
> > Michael
> >
> > On 2/3/20 3:45 PM, Nate McCall wrote:
> > > Mick, this is fantastic!
> > >
> > > I'll wait another day to see if anyone else chimes in. (Would also love
> > to
> > > hear from CassCI folks, anyone else really who has wrestled with this
> > even
> > > for internal forks).
> > >
> > > On Tue, Feb 4, 2020 at 10:37 AM Mick Semb Wever  wrote:
> > >
> > > > Nate, I leave it to you to forward what-you-chose to the board@'s
> > thread.
> > > >
> > > >
> > > > > Are there still troubles and what are they?
> > > >
> > > >
> > > > TL;DR
> > > > the ASF could provide the Cassandra community with an isolated
> > jenkins
> > > > installation: so that we can manage and control the Jenkins master, as
> > > > well as ensure all donated hardware for Jenkins agents are dedicated and
> > > > isolated to us.
> > > >
> > > >
> > > > The long writeup…
> > > >
> > > > For Cassandra's use of ASF's Jenkins I see the following problems.
> > > >
> > > > ** Lack of trust (aka reliability)
> > > >
> > > > The Jenkins agents re-use their workspaces, as opposed to using new
> > > > containers per test run, leading to broken agents, disks, git clones,
> > etc.
> > > > One broken test run, or a broken agent, too easily affects subsequent
> > test
> > > > executions.
> > > >
> > > > The comple

Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-03 Thread David Capwell
Following Mick's format =)

** Lack of trust (aka reliability)

Mick said it best, but should also add that we have slow tests and tests
which don't do anything.  Effort is needed to improve our current tests and
to make sure future tests are stable (cleaning up works, isolation, etc.);
this is not a neglectable amount of work, nor work which can be done by a
single person.

** Lack of resources (throughput and response)

Our slowest unit tests are around 2 minutes (materialized views), our
slowest dtests (not high resource) are around 30 minutes; given enough
resources we could run unit in < 10 minutes and dtest in 30-60 minutes.

There is also another thing to point out, testing is also a combinatorics
problem; we support java 8/11 (more to come), vnode and no-vnode, security
and no security, and the list goes on.  Bugs are more likely to happen when
two features interact, so it is important to test against many combinations.

There is work going on in the community to add new kinds of tests (harry,
diff, etc.); these tests require even more resources than normal tests.

** Difficulty in use

Many people rely on CircleCI as the core CI for the project, but this has a
few issues called out in other forms: the low resource version (free) is
even more flaky than high (paid), and people get locked out (i have lost
access twice so far, others have said the same).

The thing which worries me the most is that new members to the project
won't have the high resource CircleCI plan, nor do they really have access
to Jenkins.  This puts a burden on new authors where they wait 24+ hours to
run the tests... or just not run them.

** Lack of visibility into quality

This is two things for me: commit and pre-commit.

For commit, this is more what Mick was referring to as "post-commit CI".
There are a few questions I would like to know about our current tests
(report most flaky tests, which sections of code cause the most failures,
etc.); these are hard to answer at the moment .

We don't have a good pre-commit story since it mostly relies on CircleCI.
I find that some JIRAs link CircleCI and some don't.  I find that if I
follow the CircleCI link months later (to see if the build was stable
pre-commit) that Circle fails to show the workflow.

On Mon, Feb 3, 2020 at 3:42 PM Michael Shuler 
wrote:

> Only have a moment to respond, but Mick hit the higlights with
> containerization, parallelization, these help solve cleanup, speed, and
> cascading failures. Dynamic disposable slaves would be icing on that
> cake, which may require a dedicated master.
>
> One more note on jobs, or more correctly unnecessary jobs - pipelines
> have a `changeset` build condition we should tinker with. There is zero
> reason to run a job with no actual code diff. For instance, I committed
> to 2.1 this morning and merged `-s ours` nothing to the newer branches -
> there's really no reason to run and take up valuable resources with no
> actual diff changes.
> https://jenkins.io/doc/book/pipeline/syntax/#built-in-conditions
>
> Michael
>
> On 2/3/20 3:45 PM, Nate McCall wrote:
> > Mick, this is fantastic!
> >
> > I'll wait another day to see if anyone else chimes in. (Would also love
> to
> > hear from CassCI folks, anyone else really who has wrestled with this
> even
> > for internal forks).
> >
> > On Tue, Feb 4, 2020 at 10:37 AM Mick Semb Wever  wrote:
> >
> >> Nate, I leave it to you to forward what-you-chose to the board@'s
> thread.
> >>
> >>
> >>> Are there still troubles and what are they?
> >>
> >>
> >> TL;DR
> >>the ASF could provide the Cassandra community with an isolated
> jenkins
> >> installation: so that we can manage and control the Jenkins master,  as
> >> well as ensure all donated hardware for Jenkins agents are dedicated and
> >> isolated to us.
> >>
> >>
> >> The long writeup…
> >>
> >> For Cassandra's use of ASF's Jenkins I see the following problems.
> >>
> >> ** Lack of trust (aka reliability)
> >>
> >> The Jenkins agents re-use their workspaces, as opposed to using new
> >> containers per test run, leading to broken agents, disks, git clones,
> etc.
> >> One broken test run, or a broken agent, too easily affects subsequent
> test
> >> executions.
> >>
> >> The complexity (and flakiness) around our tests is a real problem.  CI
> on
> >> a project like Cassandra is a beast and the community is very limited in
> >> what it can do, it really needs the help of larger companies. Effort is
> >> required in fixing the broken, the flakey, and the ignored tests.
> >> Parallelising the tests will help by better isolating failures, but
> tests
> >> (and their execution scripts) also need to be better at cleaning up
> after
> >> themselves, or a more container approach needs to be taken.
> >>
> >> Another issue is that other projects sometimes using the agents, and
> Infra
> >> sometimes edits our build configurations (out of necessity).
> >>
> >>
> >> ** Lack of resources (throughput and response)
> >>
> >> Having only 9 agents:

Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-03 Thread Michael Shuler
Only have a moment to respond, but Mick hit the higlights with 
containerization, parallelization, these help solve cleanup, speed, and 
cascading failures. Dynamic disposable slaves would be icing on that 
cake, which may require a dedicated master.


One more note on jobs, or more correctly unnecessary jobs - pipelines 
have a `changeset` build condition we should tinker with. There is zero 
reason to run a job with no actual code diff. For instance, I committed 
to 2.1 this morning and merged `-s ours` nothing to the newer branches - 
there's really no reason to run and take up valuable resources with no 
actual diff changes.

https://jenkins.io/doc/book/pipeline/syntax/#built-in-conditions

Michael

On 2/3/20 3:45 PM, Nate McCall wrote:

Mick, this is fantastic!

I'll wait another day to see if anyone else chimes in. (Would also love to
hear from CassCI folks, anyone else really who has wrestled with this even
for internal forks).

On Tue, Feb 4, 2020 at 10:37 AM Mick Semb Wever  wrote:


Nate, I leave it to you to forward what-you-chose to the board@'s thread.



Are there still troubles and what are they?



TL;DR
   the ASF could provide the Cassandra community with an isolated jenkins
installation: so that we can manage and control the Jenkins master,  as
well as ensure all donated hardware for Jenkins agents are dedicated and
isolated to us.


The long writeup…

For Cassandra's use of ASF's Jenkins I see the following problems.

** Lack of trust (aka reliability)

The Jenkins agents re-use their workspaces, as opposed to using new
containers per test run, leading to broken agents, disks, git clones, etc.
One broken test run, or a broken agent, too easily affects subsequent test
executions.

The complexity (and flakiness) around our tests is a real problem.  CI on
a project like Cassandra is a beast and the community is very limited in
what it can do, it really needs the help of larger companies. Effort is
required in fixing the broken, the flakey, and the ignored tests.
Parallelising the tests will help by better isolating failures, but tests
(and their execution scripts) also need to be better at cleaning up after
themselves, or a more container approach needs to be taken.

Another issue is that other projects sometimes using the agents, and Infra
sometimes edits our build configurations (out of necessity).


** Lack of resources (throughput and response)

Having only 9 agents: none of which can run the large dtests; is a
problem. All 9 are from Instaclustr, much kudos! Three companies recently
have said they will donate resources, this is work in progress.

We have four release branches where we would like to provide per-commit
post-commit testing. Each complete test execution currently take 24hr+.
Parallelising tests atm won't help much as the agents are generally
saturated (with the pipelines doing the top-level parallelisation). Once we
get more hardware in place: for the sake of improving throughput; it will
make sense to look into parallelising the tests more.

The throughput of tests will also improve with effort put into
removing/rewriting long running and inefficient tests. Also, and i think
this is LHF, throughput could be improved by using (or taking inspiration
from) Apache Yetus so to only run tests on what it relevant in the
patch/commit. Ref:
http://yetus.apache.org/documentation/0.11.1/precommit-basic/


** Difficulty in use

Jenkins is clumsy to use compared to the CI systems we use more often
today: Travis, CircleCI, GH Actions.

One of the complaints has been that only committers can kick off CI for
patches (ie pre-commit CI runs).  But I don't believe this to be a crucial
issue for a number of reasons.

1. Thorough CI testing of a patch only needs to happen during the review
process, to which a committer needs to be involved in anyway.
2.  We don't have enough jenkins agents to handle the amount of throughput
that automated branch/patch/pull-request testing would require.
3. Our tests could allow unknown contributors to take ownership of the
agent servers (eg via the execution of bash scripts).
4. We have CircleCI working that provides basic testing for
work-in-progress patches.


Focusing on post-commit CI and having canonical results for our release
branches, i think then it boils down to the stability and throughput of
tests, and the persistence and permanence of results.

The persistence and permanence of results is a bug bear for me. It has
been partially addressed with posting the build results to the builds@
ML. But this only provides a (pretty raw) summary of the results. I'm keen
to take the next step of the posting of CI results back to committed jira
tickets (but am waiting on seeing Jenkins run stable for a while).  If we
had our own Jenkins master we could then look into retaining more/all build
results. Being able to see the longer term trends of test results and well
as execution times I hope would add the incentive to get more folk involved.

Looping back to th

Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-03 Thread Nate McCall
Mick, this is fantastic!

I'll wait another day to see if anyone else chimes in. (Would also love to
hear from CassCI folks, anyone else really who has wrestled with this even
for internal forks).

On Tue, Feb 4, 2020 at 10:37 AM Mick Semb Wever  wrote:

> Nate, I leave it to you to forward what-you-chose to the board@'s thread.
>
>
> > Are there still troubles and what are they?
>
>
> TL;DR
>   the ASF could provide the Cassandra community with an isolated jenkins
> installation: so that we can manage and control the Jenkins master,  as
> well as ensure all donated hardware for Jenkins agents are dedicated and
> isolated to us.
>
>
> The long writeup…
>
> For Cassandra's use of ASF's Jenkins I see the following problems.
>
> ** Lack of trust (aka reliability)
>
> The Jenkins agents re-use their workspaces, as opposed to using new
> containers per test run, leading to broken agents, disks, git clones, etc.
> One broken test run, or a broken agent, too easily affects subsequent test
> executions.
>
> The complexity (and flakiness) around our tests is a real problem.  CI on
> a project like Cassandra is a beast and the community is very limited in
> what it can do, it really needs the help of larger companies. Effort is
> required in fixing the broken, the flakey, and the ignored tests.
> Parallelising the tests will help by better isolating failures, but tests
> (and their execution scripts) also need to be better at cleaning up after
> themselves, or a more container approach needs to be taken.
>
> Another issue is that other projects sometimes using the agents, and Infra
> sometimes edits our build configurations (out of necessity).
>
>
> ** Lack of resources (throughput and response)
>
> Having only 9 agents: none of which can run the large dtests; is a
> problem. All 9 are from Instaclustr, much kudos! Three companies recently
> have said they will donate resources, this is work in progress.
>
> We have four release branches where we would like to provide per-commit
> post-commit testing. Each complete test execution currently take 24hr+.
> Parallelising tests atm won't help much as the agents are generally
> saturated (with the pipelines doing the top-level parallelisation). Once we
> get more hardware in place: for the sake of improving throughput; it will
> make sense to look into parallelising the tests more.
>
> The throughput of tests will also improve with effort put into
> removing/rewriting long running and inefficient tests. Also, and i think
> this is LHF, throughput could be improved by using (or taking inspiration
> from) Apache Yetus so to only run tests on what it relevant in the
> patch/commit. Ref:
> http://yetus.apache.org/documentation/0.11.1/precommit-basic/
>
>
> ** Difficulty in use
>
> Jenkins is clumsy to use compared to the CI systems we use more often
> today: Travis, CircleCI, GH Actions.
>
> One of the complaints has been that only committers can kick off CI for
> patches (ie pre-commit CI runs).  But I don't believe this to be a crucial
> issue for a number of reasons.
>
> 1. Thorough CI testing of a patch only needs to happen during the review
> process, to which a committer needs to be involved in anyway.
> 2.  We don't have enough jenkins agents to handle the amount of throughput
> that automated branch/patch/pull-request testing would require.
> 3. Our tests could allow unknown contributors to take ownership of the
> agent servers (eg via the execution of bash scripts).
> 4. We have CircleCI working that provides basic testing for
> work-in-progress patches.
>
>
> Focusing on post-commit CI and having canonical results for our release
> branches, i think then it boils down to the stability and throughput of
> tests, and the persistence and permanence of results.
>
> The persistence and permanence of results is a bug bear for me. It has
> been partially addressed with posting the build results to the builds@
> ML. But this only provides a (pretty raw) summary of the results. I'm keen
> to take the next step of the posting of CI results back to committed jira
> tickets (but am waiting on seeing Jenkins run stable for a while).  If we
> had our own Jenkins master we could then look into retaining more/all build
> results. Being able to see the longer term trends of test results and well
> as execution times I hope would add the incentive to get more folk involved.
>
> Looping back to the ASF and what they could do: it would help us a lot in
> improving the stability and usability issues by providing us an isolated
> jenkins. Having our own master would simplify the setup, use and debugging,
> of Jenkins. It would still require some sunk cost but hopefully we'd end up
> with something better tailored to our needs. And with isolated agents help
> restore confidence.
>
> regards,
> Mick
>
> PS i really want to hear from those that were involved in the past with
> cassci, your skills and experience on this topic surpass anything i got.
>
>
>
> On Sun, 2 Feb 2020, at 22:51, Nate

Re: Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-03 Thread Mick Semb Wever
Nate, I leave it to you to forward what-you-chose to the board@'s thread.


> Are there still troubles and what are they?


TL;DR
  the ASF could provide the Cassandra community with an isolated jenkins 
installation: so that we can manage and control the Jenkins master,  as well as 
ensure all donated hardware for Jenkins agents are dedicated and isolated to us.


The long writeup…

For Cassandra's use of ASF's Jenkins I see the following problems.

** Lack of trust (aka reliability)

The Jenkins agents re-use their workspaces, as opposed to using new containers 
per test run, leading to broken agents, disks, git clones, etc. One broken test 
run, or a broken agent, too easily affects subsequent test executions.

The complexity (and flakiness) around our tests is a real problem.  CI on a 
project like Cassandra is a beast and the community is very limited in what it 
can do, it really needs the help of larger companies. Effort is required in 
fixing the broken, the flakey, and the ignored tests. Parallelising the tests 
will help by better isolating failures, but tests (and their execution scripts) 
also need to be better at cleaning up after themselves, or a more container 
approach needs to be taken.
 
Another issue is that other projects sometimes using the agents, and Infra 
sometimes edits our build configurations (out of necessity).


** Lack of resources (throughput and response)

Having only 9 agents: none of which can run the large dtests; is a problem. All 
9 are from Instaclustr, much kudos! Three companies recently have said they 
will donate resources, this is work in progress.

We have four release branches where we would like to provide per-commit 
post-commit testing. Each complete test execution currently take 24hr+. 
Parallelising tests atm won't help much as the agents are generally saturated 
(with the pipelines doing the top-level parallelisation). Once we get more 
hardware in place: for the sake of improving throughput; it will make sense to 
look into parallelising the tests more.

The throughput of tests will also improve with effort put into 
removing/rewriting long running and inefficient tests. Also, and i think this 
is LHF, throughput could be improved by using (or taking inspiration from) 
Apache Yetus so to only run tests on what it relevant in the patch/commit. Ref: 
http://yetus.apache.org/documentation/0.11.1/precommit-basic/ 


** Difficulty in use

Jenkins is clumsy to use compared to the CI systems we use more often today: 
Travis, CircleCI, GH Actions.

One of the complaints has been that only committers can kick off CI for patches 
(ie pre-commit CI runs).  But I don't believe this to be a crucial issue for a 
number of reasons. 

1. Thorough CI testing of a patch only needs to happen during the review 
process, to which a committer needs to be involved in anyway.
2.  We don't have enough jenkins agents to handle the amount of throughput that 
automated branch/patch/pull-request testing would require.
3. Our tests could allow unknown contributors to take ownership of the agent 
servers (eg via the execution of bash scripts).
4. We have CircleCI working that provides basic testing for work-in-progress 
patches.


Focusing on post-commit CI and having canonical results for our release 
branches, i think then it boils down to the stability and throughput of tests, 
and the persistence and permanence of results.

The persistence and permanence of results is a bug bear for me. It has been 
partially addressed with posting the build results to the builds@ ML. But this 
only provides a (pretty raw) summary of the results. I'm keen to take the next 
step of the posting of CI results back to committed jira tickets (but am 
waiting on seeing Jenkins run stable for a while).  If we had our own Jenkins 
master we could then look into retaining more/all build results. Being able to 
see the longer term trends of test results and well as execution times I hope 
would add the incentive to get more folk involved.

Looping back to the ASF and what they could do: it would help us a lot in 
improving the stability and usability issues by providing us an isolated 
jenkins. Having our own master would simplify the setup, use and debugging, of 
Jenkins. It would still require some sunk cost but hopefully we'd end up with 
something better tailored to our needs. And with isolated agents help restore 
confidence.

regards,
Mick

PS i really want to hear from those that were involved in the past with cassci, 
your skills and experience on this topic surpass anything i got.



On Sun, 2 Feb 2020, at 22:51, Nate McCall wrote:
> Hi folks,
> The board is looking for feedback on CI infrastructure. I'm happy to take
> some (constructive) comments back. (Shuler, Mick and David Capwell
> specifically as folks who've most recently wrestled with this a fair bit).
> 
> Thanks,
> -Nate
> 
> -- Forwarded message -
> From: Dave Fisher 
> Date: Mon, Feb 3, 2020 at 8:58 AM
> Subject: [CI] W

Fwd: [CI] What are the troubles projects face with CI and Infra

2020-02-02 Thread Nate McCall
Hi folks,
The board is looking for feedback on CI infrastructure. I'm happy to take
some (constructive) comments back. (Shuler, Mick and David Capwell
specifically as folks who've most recently wrestled with this a fair bit).

Thanks,
-Nate

-- Forwarded message -
From: Dave Fisher 
Date: Mon, Feb 3, 2020 at 8:58 AM
Subject: [CI] What are the troubles projects face with CI and Infra
To: Apache Board 


Hi -

It has come to the attention of the board through looking at past board
reports that some projects are having problems with CI infrastructure.

Are there still troubles and what are they?

Regards,
Dave