Re: Pulsar CI congested, master branch build broken

Nicolò Boschi Fri, 09 Sep 2022 02:16:31 -0700

As you may have noticed, the CI is slow again.
There are more than 140 workflows pending:
https://github.com/apache/pulsar/actions?query=is%3Aqueued
There are only 2-3 workflows in progress:
https://github.com/apache/pulsar/actions?query=is%3Ain_progress


Lari and I believe that we're still penalized by the algorithm for resource
allocation across ASF projects.
We're still working on other optimizations to reduce the requested runners
and to reduce resource usage,
but if the algorithm still allows Pulsar to use at most 2-3 runners at the
same time, we'll never get out of this situation.

We're waiting for updates from the Github support issue that Lari reported.

Nicolò Boschi


Il giorno ven 9 set 2022 alle ore 04:09 Michael Marshall <
[email protected]> ha scritto:

> Fantastic, thank you Lari and Nicolò!
>
> - Michael
>
> On Thu, Sep 8, 2022 at 9:03 PM Haiting Jiang <[email protected]>
> wrote:
> >
> > Great work. Thank you, Lari and Nicolò.
> >
> > BR,
> > Haiting
> >
> > On Fri, Sep 9, 2022 at 9:36 AM tison <[email protected]> wrote:
> > >
> > > Thank you, Lari and Nicolò!
> > > Best,
> > > tison.
> > >
> > >
> > > Nicolò Boschi <[email protected]> 于2022年9月9日周五 02:41写道：
> > >
> > > > Dear community,
> > > >
> > > > The plan has been executed.
> > > > The summary of our actions is:
> > > > 1. We cancelled all pending jobs (queue and in-progress)
> > > > 2. We removed the required checks to be able to merge improvements
> on the
> > > > CI workflow
> > > > 3. We merged a couple of improvements:
> > > >    1. workarounded the possible bug triggered by jobs retries. Now
> > > > broker flaky tests are in a dedicated workflow
> > > >    2. moved known flaky tests to the flaky suite
> > > >    3. optimized the runner consumption for docs-only and cpp-only
> pulls
> > > > 4. We reactivated the required checks.
> > > >
> > > >
> > > > Now it's possible to come back to normal life.
> > > > 1. You must rebase your branch to the latest master (there's a
> button for
> > > > you in the UI) or eventually you can close/reopen the pull to
> trigger the
> > > > checks
> > > > 2. You can merge a pull request if you want
> > > > 3. You will find a new job in the Checks section called "Pulsar CI /
> Pulsar
> > > > CI checks completed" that indicates the Pulsar CI successfully passed
> > > >
> > > > There's a slight chance that the CI will be stuck again in the next
> few
> > > > days but we will take it monitored.
> > > >
> > > > Thanks Lari for the nice work!
> > > >
> > > > Regards,
> > > > Nicolò Boschi
> > > >
> > > >
> > > > Il giorno gio 8 set 2022 alle ore 10:55 Lari Hotari <
> [email protected]>
> > > > ha
> > > > scritto:
> > > >
> > > > > Thank you Nicolo.
> > > > > There's lazy consensus, let's go forward with the action plan.
> > > > >
> > > > > -Lari
> > > > >
> > > > > On 2022/09/08 08:16:05 Nicolò Boschi wrote:
> > > > > > This is the pull for step 2.
> > > > https://github.com/apache/pulsar/pull/17539
> > > > > >
> > > > > > This is the script I'm going to use to cancel pending workflows.
> > > > > >
> > > > >
> > > >
> https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js
> > > > > >
> > > > > > I'm going to run the script in minutes.
> > > > > >
> > > > > > I advertised on Slack about what is happening:
> > > > > >
> > > > >
> > > >
> https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E
> > > > > >
> > > > > > >we’re going to execute the plan described in the ML. So any
> queued
> > > > > actions
> > > > > > will be cancelled. In order to validate your pull it is
> suggested to
> > > > run
> > > > > > the actions in your own Pulsar fork. Please don’t re-run failed
> jobs or
> > > > > > push any other commits to avoid triggering new actions
> > > > > >
> > > > > >
> > > > > > Nicolò Boschi
> > > > > >
> > > > > >
> > > > > > Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <
> > > > > [email protected]>
> > > > > > ha scritto:
> > > > > >
> > > > > > > Thanks Lari for the detailed explanation. This is kind of an
> > > > emergency
> > > > > > > situation and I believe your plan is the way to go now.
> > > > > > >
> > > > > > > I already prepared a pull for moving the flaky suite out of the
> > > > Pulsar
> > > > > CI
> > > > > > > workflow: https://github.com/nicoloboschi/pulsar/pull/8
> > > > > > > I can take care of the execution of the plan.
> > > > > > >
> > > > > > > > 1. Cancel all existing builds in_progress or queued
> > > > > > >
> > > > > > > I have a script locally that uses GHA to check and cancel the
> pending
> > > > > > > runs. We can extend it to all the queued builds (will share it
> soon).
> > > > > > >
> > > > > > > > 2. Edit .asf.yaml and drop the "required checks" requirement
> for
> > > > > merging
> > > > > > > PRs.
> > > > > > > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > > >
> > > > > > > After the pull is out, we'll need to cancel all other
> workflows that
> > > > > > > contributors may inadvertently have triggered.
> > > > > > >
> > > > > > > > 4. Disable all workflows
> > > > > > > > 5. Process specific PRs manually to improve the situation.
> > > > > > > >    - Make GHA workflow improvements such as
> > > > > > > https://github.com/apache/pulsar/pull/17491 and
> > > > > > > https://github.com/apache/pulsar/pull/17490
> > > > > > > >    - Quarantine all very flaky tests so that everyone
> doesn't waste
> > > > > time
> > > > > > > with those. It should be possible to merge a PR even when a
> > > > quarantined
> > > > > > > test fails.
> > > > > > >
> > > > > > > in this step we will merge this
> > > > > > > https://github.com/nicoloboschi/pulsar/pull/8
> > > > > > >
> > > > > > > I want to add to the list this improvement to reduce runners
> usage in
> > > > > case
> > > > > > > of doc or cpp changes.
> > > > > > > https://github.com/nicoloboschi/pulsar/pull/7
> > > > > > >
> > > > > > >
> > > > > > > > 6. Rebase PRs (or close and re-open) that would be processed
> next
> > > > so
> > > > > > > that changes are picked up
> > > > > > >
> > > > > > > It's better to leave this task to the author of the pull in
> order to
> > > > > not
> > > > > > > create too much load at the same time
> > > > > > >
> > > > > > > > 7. Enable workflows
> > > > > > > > 8. Start processing PRs with checks to see if things are
> handled
> > > > in a
> > > > > > > better way.
> > > > > > > > 9. When things are stable, enable required checks again in
> > > > > .asf.yaml, in
> > > > > > > the meantime be careful about merging PRs
> > > > > > > > 10. Fix quarantined flaky tests
> > > > > > >
> > > > > > >
> > > > > > > Nicolò Boschi
> > > > > > >
> > > > > > >
> > > > > > > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <
> > > > > [email protected]>
> > > > > > > ha scritto:
> > > > > > >
> > > > > > >> If my assumption of the GitHub usage metrics bug in the GitHub
> > > > Actions
> > > > > > >> build job queue fairness algorithm is correct, what would
> help is
> > > > > running
> > > > > > >> the flaky unit test group outside of Pulsar CI workflow. In
> that
> > > > > case, the
> > > > > > >> impact of the usage metrics would be limited.
> > > > > > >>
> > > > > > >> The example of
> > > > > > >>
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > shows
> > > > > > >> this flaw as explained in the previous email. The total
> reported
> > > > > execution
> > > > > > >> time in that report is 1d 1h 40m 21s of usage and the actual
> usage
> > > > is
> > > > > about
> > > > > > >> 1/3 of this.
> > > > > > >>
> > > > > > >> When we move the most commonly failing job out of Pulsar CI
> > > > workflow,
> > > > > the
> > > > > > >> impact of the possible usage metrics bug would be much less.
> I hope
> > > > > GitHub
> > > > > > >> support responds to my issue and queries about this bug. It
> might
> > > > > take up
> > > > > > >> to 7 days to get a reply and for technical questions more
> time. In
> > > > the
> > > > > > >> meantime we need a solution for getting over this CI slowness
> issue.
> > > > > > >>
> > > > > > >> -Lari
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On 2022/09/08 06:34:42 Lari Hotari wrote:
> > > > > > >> > My current assumption of the CI slowness problem is that
> the usage
> > > > > > >> metrics for Apache Pulsar builds on GitHub side is done
> incorrectly
> > > > > and
> > > > > > >> that is resulting in apache/pulsar builds getting throttled.
> This
> > > > > > >> assumption might be wrong, but it's the best guess at the
> moment.
> > > > > > >> >
> > > > > > >> > The facts that support this assumption is that when
> re-running
> > > > > failed
> > > > > > >> jobs in a workflow, the execution times for previously
> successful
> > > > > jobs get
> > > > > > >> counted as if they have all run:
> > > > > > >> > Here's an example:
> > > > > > >>
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > > > >> > The reported total usage is about 3x than the actual usage.
> > > > > > >> >
> > > > > > >> > The assumption that I have is that the "fairness algorithm"
> that
> > > > > GitHub
> > > > > > >> uses to provide all Apache projects about the same amount of
> GitHub
> > > > > Actions
> > > > > > >> resources would take this flawed usage as the basis of it's
> > > > decisions
> > > > > and
> > > > > > >> it decides to throttle apache/pulsar builds.
> > > > > > >> >
> > > > > > >> > The reason why we are getting hit by this now is that there
> is a
> > > > > high
> > > > > > >> number of flaky test failures that cause almost every build
> to fail
> > > > > and we
> > > > > > >> have been re-running a lot of builds.
> > > > > > >> >
> > > > > > >> > The other fact to support the theory of flawed usage
> metrics used
> > > > in
> > > > > > >> the fairness algorithm is that other Apache projects aren't
> > > > reporting
> > > > > > >> issues about GitHub Actions slowness. This is mentioned in
> Jarek
> > > > > Potiuk's
> > > > > > >> comments on INFRA-23633 [1]:
> > > > > > >> > > Unlike the case 2 years ago, the problem is not affecting
> all
> > > > > > >> projects. In Apache Airflow we do > not see any particular
> slow-down
> > > > > with
> > > > > > >> Public Runners at this moment (just checked - >
> > > > > > >> > > everything is "as usual").. So I'd say it is something
> specific
> > > > to
> > > > > > >> Pulsar not to "ASF" as a whole.
> > > > > > >> >
> > > > > > >> > There are also other comments from Jarek about the GitHub
> > > > "fairness
> > > > > > >> algorithm" (comment [2], other comment [3])
> > > > > > >> > > But I believe the current problem is different - it might
> be
> > > > > (looking
> > > > > > >> at your jobs) simply a bug
> > > > > > >> > > in GA that you hit or indeed your demands are simply too
> high.
> > > > > > >> >
> > > > > > >> > I have opened tickets (2 tickets: 2 days ago and yesterday)
> to
> > > > > > >> support.github.com and there hasn't been any response to the
> > > > ticket.
> > > > > It
> > > > > > >> might take up to 7 days to get a response. We cannot rely on
> GitHub
> > > > > Support
> > > > > > >> resolving this issue.
> > > > > > >> >
> > > > > > >> > I propose that we go ahead with the previously suggested
> action
> > > > plan
> > > > > > >> > > One possible way forward:
> > > > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > > > >> > > 2. Edit .asf.yaml and drop the "required checks"
> requirement for
> > > > > > >> merging PRs.
> > > > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > > >> > > 4. Disable all workflows
> > > > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > > > >> > >    - Make GHA workflow improvements such as
> > > > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > > > >> https://github.com/apache/pulsar/pull/17490
> > > > > > >> > >    - Quarantine all very flaky tests so that everyone
> doesn't
> > > > > waste
> > > > > > >> time with those. It should be possible to merge a PR even
> when a
> > > > > > >> quarantined test fails.
> > > > > > >> > > 6. Rebase PRs (or close and re-open) that would be
> processed
> > > > next
> > > > > so
> > > > > > >> that changes are picked up
> > > > > > >> > > 7. Enable workflows
> > > > > > >> > > 8. Start processing PRs with checks to see if things are
> handled
> > > > > in a
> > > > > > >> better way.
> > > > > > >> > > 9. When things are stable, enable required checks again in
> > > > > .asf.yaml,
> > > > > > >> in the meantime be careful about merging PRs
> > > > > > >> > > 10. Fix quarantined flaky tests
> > > > > > >> >
> > > > > > >> > To clarify, steps 1-6 would be done optimally in 1 day and
> we
> > > > would
> > > > > > >> stop processing ordinary PRs during this time. We would only
> handle
> > > > > PRs
> > > > > > >> that fix the CI situation during this exceptional period.
> > > > > > >> >
> > > > > > >> > -Lari
> > > > > > >> >
> > > > > > >> > Links to Jarek's comments:
> > > > > > >> > [1]
> > > > > > >>
> > > > >
> > > >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> > > > > > >> > [2]
> > > > > > >>
> > > > >
> > > >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > > > >> > [3]
> > > > > > >>
> > > > >
> > > >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > > > >> >
> > > > > > >> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> > > > > > >> > > One possible way forward:
> > > > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > > > >> > > 2. Edit .asf.yaml and drop the "required checks"
> requirement for
> > > > > > >> merging PRs.
> > > > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > > >> > > 4. Disable all workflows
> > > > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > > > >> > >    - Make GHA workflow improvements such as
> > > > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > > > >> https://github.com/apache/pulsar/pull/17490
> > > > > > >> > >    - Quarantine all very flaky tests so that everyone
> doesn't
> > > > > waste
> > > > > > >> time with those. It should be possible to merge a PR even
> when a
> > > > > > >> quarantined test fails.
> > > > > > >> > > 6. Rebase PRs (or close and re-open) that would be
> processed
> > > > next
> > > > > so
> > > > > > >> that changes are picked up
> > > > > > >> > > 7. Enable workflows
> > > > > > >> > > 8. Start processing PRs with checks to see if things are
> handled
> > > > > in a
> > > > > > >> better way.
> > > > > > >> > > 9. When things are stable, enable required checks again in
> > > > > .asf.yaml,
> > > > > > >> in the meantime be careful about merging PRs
> > > > > > >> > > 10. Fix quarantined flaky tests
> > > > > > >> > >
> > > > > > >> > > -Lari
> > > > > > >> > >
> > > > > > >> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > > > > > >> > > > The problem with CI is becoming worse. The build queue
> is 235
> > > > > jobs
> > > > > > >> now and the queue time is over 7 hours.
> > > > > > >> > > >
> > > > > > >> > > > We will need to start shedding load in the build queue
> and get
> > > > > some
> > > > > > >> fixes in.
> > > > > > >> > > > https://issues.apache.org/jira/browse/INFRA-23633
> continues
> > > > to
> > > > > > >> contain details about some activities. I have created 2 GitHub
> > > > Support
> > > > > > >> tickets, but usually it takes up to a week to get a response.
> > > > > > >> > > >
> > > > > > >> > > > I have some assumptions about the issue, but they are
> just
> > > > > > >> assumptions.
> > > > > > >> > > > One oddity is that when re-running failed jobs is used
> in a
> > > > > large
> > > > > > >> workflow, the execution times for previously successful jobs
> get
> > > > > counted as
> > > > > > >> if they have run.
> > > > > > >> > > > Here's an example:
> > > > > > >>
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > > > >> > > > The reported usage is about 3x than the actual usage.
> > > > > > >> > > > The assumption that I have is that the "fairness
> algorithm"
> > > > that
> > > > > > >> GitHub uses to provide all Apache projects about the same
> amount of
> > > > > GitHub
> > > > > > >> Actions resources would take this flawed usage as the basis
> of it's
> > > > > > >> decisions.
> > > > > > >> > > > The reason why we are getting hit by this now is that
> there
> > > > is a
> > > > > > >> high number of flaky test failures that cause almost every
> build to
> > > > > fail
> > > > > > >> and we are re-running a lot of builds.
> > > > > > >> > > >
> > > > > > >> > > > Another problem there is that the GitHub Actions search
> > > > doesn't
> > > > > > >> always show all workflow runs that are running. This has
> happened
> > > > > before
> > > > > > >> when the GitHub Actions workflow search index was corrupted.
> GitHub
> > > > > Support
> > > > > > >> resolved that by rebuilding the search index with some manual
> admin
> > > > > > >> operation behind the scenes.
> > > > > > >> > > >
> > > > > > >> > > > I'm proposing that we start shedding load from CI by
> > > > cancelling
> > > > > > >> build jobs and selecting which jobs to process so that we get
> the CI
> > > > > issue
> > > > > > >> resolved. We might also have to disable required checks so
> that we
> > > > > have
> > > > > > >> some way to get changes merged while CI doesn't work properly.
> > > > > > >> > > >
> > > > > > >> > > > I'm expecting lazy consensus on fixing CI unless someone
> > > > > proposes a
> > > > > > >> better plan. Let's keep everyone informed in this mailing list
> > > > thread.
> > > > > > >> > > >
> > > > > > >> > > > -Lari
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > > > >> > > > > We are going to need to take actions to fix our
> problems.
> > > > See
> > > > > > >>
> > > > >
> > > >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > > > >> > > > >
> > > > > > >> > > > > Jarek has done a large amount of GitHub Action work
> with
> > > > > Apache
> > > > > > >> Airflow and his suggestions might be helpful. One of his
> suggestions
> > > > > was
> > > > > > >> Apache Yetus. I think he means using the Maven plugins -
> > > > > > >>
> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <
> > > > [email protected]
> > > > > >
> > > > > > >> wrote:
> > > > > > >> > > > > >
> > > > > > >> > > > > > The Apache Infra ticket is
> > > > > > >> https://issues.apache.org/jira/browse/INFRA-23633 .
> > > > > > >> > > > > >
> > > > > > >> > > > > > -Lari
> > > > > > >> > > > > >
> > > > > > >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > > > >> > > > > >> I asked for an update on the Apache org GitHub
> Actions
> > > > > usage
> > > > > > >> stats from Gavin McDonald on the-asf slack in this thread:
> > > > > > >>
> > > > >
> > > >
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > > > > >> .
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> I hope we get this issue resolved since it delays
> PR
> > > > > > >> processing a lot.
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> -Lari
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > > > >> > > > > >>> Pulsar CI continues to be congested, and the
> build queue
> > > > > [1]
> > > > > > >> is very long at the moment. There are 147 build jobs in the
> queue
> > > > and
> > > > > 16
> > > > > > >> jobs in progress at the moment.
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>> I would strongly advice everyone to use "personal
> CI" to
> > > > > > >> mitigate the issue of the long delay of CI feedback. You can
> simply
> > > > > open a
> > > > > > >> PR to your own personal fork of apache/pulsar to run the
> builds in
> > > > > your
> > > > > > >> "personal CI". There's more details in the previous emails in
> this
> > > > > thread.
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>> -Lari
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>> [1] - build queue:
> > > > > > >> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > > > >> > > > > >>>> Pulsar CI continues to be congested, and the
> build
> > > > queue
> > > > > is
> > > > > > >> long.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> I would strongly advice everyone to use
> "personal CI"
> > > > to
> > > > > > >> mitigate the issue of the long delay of CI feedback. You can
> simply
> > > > > open a
> > > > > > >> PR to your own personal fork of apache/pulsar to run the
> builds in
> > > > > your
> > > > > > >> "personal CI". There's more details in the previous email in
> this
> > > > > thread.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> Some updates:
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> There has been a discussion with Gavin McDonald
> from
> > > > ASF
> > > > > > >> infra on the-asf slack about getting usage reports from
> GitHub to
> > > > > support
> > > > > > >> the investigation. Slack thread is the same one mentioned in
> the
> > > > > previous
> > > > > > >> email,
> > > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> > > > > .
> > > > > > >> Gavin already requested the usage report in GitHub UI, but it
> > > > produced
> > > > > > >> invalid results.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> I made a change to mitigate a source of
> additional
> > > > GitHub
> > > > > > >> Actions overhead.
> > > > > > >> > > > > >>>> In the past, each cherry-picked commit to a
> maintenance
> > > > > > >> branch of Pulsar has triggered a lot of workflow runs.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> The solution for cancelling duplicate builds
> > > > > automatically
> > > > > > >> is to add this definition to the workflow definition:
> > > > > > >> > > > > >>>> concurrency:
> > > > > > >> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > > > >> > > > > >>>>  cancel-in-progress: true
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> I added this to all maintenance branch GitHub
> Actions
> > > > > > >> workflows:
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> branch-2.10 change:
> > > > > > >> > > > > >>>>
> > > > > > >>
> > > > >
> > > >
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > > > >> > > > > >>>> branch-2.9 change:
> > > > > > >> > > > > >>>>
> > > > > > >>
> > > > >
> > > >
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > > > >> > > > > >>>> branch-2.8 change:
> > > > > > >> > > > > >>>>
> > > > > > >>
> > > > >
> > > >
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > > > >> > > > > >>>> branch-2.7:
> > > > > > >> > > > > >>>>
> > > > > > >>
> > > > >
> > > >
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> branch-2.11 already contains the necessary
> config for
> > > > > > >> cancelling duplicate builds.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> The benefit of the above change is that when
> multiple
> > > > > > >> commits are cherry-picked to a branch at once, only the build
> of the
> > > > > last
> > > > > > >> commit will get run eventually. The builds for the
> intermediate
> > > > > commits
> > > > > > >> will get cancelled. Obviously there's a tradeoff here that we
> don't
> > > > > get the
> > > > > > >> information if one of the earlier commits breaks the build.
> It's the
> > > > > cost
> > > > > > >> that we need to pay. Nevertheless our build is so flaky that
> it's
> > > > > hard to
> > > > > > >> determine whether a failed build result is only caused by bad
> flaky
> > > > > test or
> > > > > > >> whether it's an actual failure. Because of this we don't lose
> > > > > anything by
> > > > > > >> cancelling builds. It's more important to save build
> resources. In
> > > > the
> > > > > > >> maintenance branches for 2.10 and older, the average total
> build
> > > > time
> > > > > > >> consumed is around 20 hours which is a lot.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> At this time, the overhead of maintenance branch
> builds
> > > > > > >> doesn't seem to be the source of the problems. There must be
> some
> > > > > other
> > > > > > >> issue which is possibly related to exceeding a usage quota.
> > > > Hopefully
> > > > > we
> > > > > > >> get the CI slowness issue solved asap.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> BR,
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> Lari
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > > > >> > > > > >>>>> Hi,
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> GitHub Actions builds have been piling up in
> the build
> > > > > > >> queue in the last few days.
> > > > > > >> > > > > >>>>> I posted on [email protected]
> > > > > > >>
> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s
> > > > and
> > > > > > >> created INFRA ticket
> > > > > https://issues.apache.org/jira/browse/INFRA-23633
> > > > > > >> about this issue.
> > > > > > >> > > > > >>>>> There's also a thread on the-asf slack,
> > > > > > >>
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> It seems that our build queue is finally getting
> > > > picked
> > > > > up,
> > > > > > >> but it would be great to see if we hit quota and whether that
> is the
> > > > > cause
> > > > > > >> of pauses.
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> Another issue is that the master branch broke
> after
> > > > > merging
> > > > > > >> 2 conflicting PRs.
> > > > > > >> > > > > >>>>> The fix is in
> > > > > https://github.com/apache/pulsar/pull/17300
> > > > > > >> .
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> Merging PRs will be slow until we have these 2
> > > > problems
> > > > > > >> solved and existing PRs rebased over the changes. Let's
> prioritize
> > > > > merging
> > > > > > >> #17300 before pushing more changes.
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> I'd like to point out that a good way to get
> build
> > > > > feedback
> > > > > > >> before sending a PR, is to run builds on your personal GitHub
> > > > Actions
> > > > > CI.
> > > > > > >> The benefit of this is that it doesn't consume the shared
> quota and
> > > > > builds
> > > > > > >> usually start instantly.
> > > > > > >> > > > > >>>>> There are instructions in the contributors
> guide about
> > > > > > >> this.
> > > > > > >> > > > > >>>>>
> > > > > > >>
> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > > > >> > > > > >>>>> You simply open PRs to your own fork of
> apache/pulsar
> > > > to
> > > > > > >> run builds on your personal GitHub Actions CI.
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> BR,
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> Lari
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
>

Re: Pulsar CI congested, master branch build broken

Reply via email to