[Discuss] Reducing GitHub Runners for NuttX CI

Lee, Lup Yuen Thu, 12 Sep 2024 07:07:18 -0700

Hi All: We're modifying NuttX CI (Continuous Integration) and GitHub
Actions, to comply with ASF Policy. Unfortunately, these changes will
extend the Build Duration for a NuttX Pull Request by roughly 15 mins, from
2 hours to 2.25 hours.


Lemme explain: Right now, every NuttX Pull Request will trigger 24
Concurrent Jobs (GitHub Runners), executing them in parallel:
https://lupyuen.github.io/articles/ci

According to ASF Policy: We should run at most 15 Concurrent Jobs:
https://infra.apache.org/github-actions-policy.html

Thus we'll cut down the Concurrent Jobs from 24 down to 15. That's 12 Linux
Jobs, 2 macOS, 1 Windows. (Each job takes 30 mins to 2 hours)

For Phase 1:
https://lupyuen.github.io/articles/ci#appendix-phase-1-of-ci-upgrade

(1) Right now our "Linux > Strategy" is a flat list of 20 Linux Jobs, all
executed in parallel...

      matrix:
        boards: [arm-01, arm-02, arm-03, arm-04, arm-05, arm-06, arm-07,
arm-08, arm-09, arm-10, arm-11, arm-12, arm-13, other, risc-v-01,
risc-v-02, sim-01, sim-02, xtensa-01, xtensa-02]

(2) We change "Linux > Strategy" to prioritise by Target Architecture, and
limit to 12 concurrent jobs...

      max-parallel: 12
      matrix:
        boards: [
          arm-01, other, risc-v-01, sim-01, xtensa-01,
          arm-02, risc-v-02, sim-02, xtensa-02,
          arm-03, arm-04, arm-05, arm-06, arm-07, arm-08, arm-09, arm-10,
arm-11, arm-12, arm-13
        ]

(3) So NuttX CI will initially execute 12 Build Jobs across Arm32, Arm64,
RISC-V, Simulator and Xtensa. As they complete, NuttX CI will execute the
remaining 8 Build Jobs (for Arm32).

(4) This will extend the Overall Build Duration from 2 hours to 2.25 hours
(link above)

(5) We also limit macOS Jobs to 2, Windows Jobs to 1. Here's the Draft PR,
please lemme know what you think: https://github.com/apache/nuttx/pull/13412

For Phase 2:
https://lupyuen.github.io/articles/ci#appendix-phase-2-of-ci-upgrade

We should "rebalance" the Build Targets. Move the Newer or Higher Priority
or Riskier Targets to arm-01, risc-v-01, sim-01, xtensa-01. Hopefully this
will allow NuttX CI to Fail Faster (for breaking changes), and prevent
unnecessary builds (also reduce waiting time).

For Phase 3:
https://lupyuen.github.io/articles/ci#appendix-phase-3-of-ci-upgrade

We should migrate most of the NuttX Targets to a Daily Job for Build and
Test. Please check out the discussion below.

Lup

On Wed, Sep 11, 2024 at 11:02 PM Lee, Lup Yuen <lu...@appkaki.com> wrote:

> << For PRs, depending on the directory (directories) of the modified
> file(s), pick and choose which tests to run. >>
>
> Thanks Nathan for the cool ideas! I was thinking: If the Modified Source
> File is shared by Multiple Targets, then which NuttX Target do we build and
> test?
>
> Maybe we could take the NuttX Target ELF, and `objdump` the Arm / RISC-V
> Disassembly, producing the Source Pathnames. Then we could figure out which
> NuttX Target depends on which Source File?
>
> <<  In addition to GitHub Actions, the ASF also offers BuildBot and
> Jenkins. In order to continue testing all ~1600 configurations every day,
> we could adopt one of these systems to make one full test run nightly. >>
>
> Yep we could rewrite our GitHub Actions workflows for BuildBot and
> Jenkins. Alternatively: Could we exploit this GitHub Actions Loophole...
>
> Suppose we fork the NuttX Repo into our Personal GitHub Accounts. Any
> GitHub Runners that we trigger in our repo, will NOT be counted in the ASF
> Quota for GitHub Runners!
>
> So we could have a bunch of Personal GitHub Accounts running NuttX Builds
> and Tests every day. We could create a system to distribute / scatter /
> gather the Builds and Tests across the "crowd-sourced" accounts?
>
> Lup
>
> On Wed, Sep 11, 2024 at 4:17 AM Nathan Hartman <hartman.nat...@gmail.com>
> wrote:
>
>> Thank you Lup!
>>
>> I have been thinking about ways to reduce the number of builds and compute
>> costs while still getting good (or even better) test coverage:
>>
>> For PRs, depending on the directory (directories) of the modified file(s),
>> pick and choose which tests to run. We already do this for Documentation
>> vs
>> all others, so it could be expanded to make the logic more fine-grained.
>>
>> Obviously things like sched and upper half drivers can affect all builds,
>> but not all PRs touch those.
>>
>> Many PRs fix a specific architecture or board related issue.
>>
>> Whenever a PR is limited to a board directory, only the configs in that
>> directory should be tested.
>>
>> When a PR is limited to an arch, we could run tests for all boards in that
>> arch but that seems wasteful. Perhaps we could choose one board from each
>> arch and only test it? It would have to be the most feature-packed board
>> in
>> that arch to get acceptable test coverage. As a special case, if a PR
>> affects both an arch and a board within that arch, test the affected
>> board.
>>
>> Another idea is perhaps use some kind of round robin approach: test only
>> one board per PR test run, but use a different board each time. Eventually
>> all boards get tested. Yes, I know that issues won't be caught
>> immediately,
>> but the commit range will be known (within the last ~1600 PRs merged) and
>> git bisect can find the specific commit with only a few tests. This is a
>> cost/benefit decision. Also, see below:
>>
>> 2.) In addition to GitHub Actions, the ASF also offers BuildBot and
>> Jenkins.
>>
>> In order to continue testing all ~1600 configurations every day, we could
>> adopt one of these systems to make one full test run nightly.
>>
>> This way, instead of running ~1600 builds multiple times per day at high
>> cost (and, for those builds that end up being redundant with no effective
>> difference, high cost and low benefit), we could instead run all the
>> configurations once per day, get virtually the same amount of benefit, and
>> greatly reduce the compute cost.
>>
>> We could pick an off-peak time of day for the tests. Or, we could ask
>> Infra
>> when Jenkins or BuildBot tend to be quite and schedule our "nightly"
>> (could
>> be morning or afternoon depending on where you live) tests for that time.
>>
>> One downside to this approach is that some broken PRs may be merged and
>> not
>> caught until the next day, so we may have a little bit of breakage. It
>> remains to be seen how much impact that could actually cause. This can be
>> addressed in various ways, which we can discuss if it becomes a problem in
>> practice.
>>
>> Thoughts?
>>
>> Cheers,
>> Nathan
>>
>> On Tue, Sep 10, 2024 at 9:51 AM Lee, Lup Yuen <lu...@appkaki.com> wrote:
>>
>> > This article explains how we're running Continuous Integration with
>> GitHub
>> > Actions. Every NuttX Pull Request will trigger 1,594 NuttX Builds!
>> >
>> > https://lupyuen.codeberg.page/articles/ci.html
>> >
>> > In my next message: I'll discuss how we might cut down the NuttX Builds
>> for
>> > Continuous Integration. Stay Tuned!
>> >
>> > Lup
>> >
>>
>

[Discuss] Reducing GitHub Runners for NuttX CI

Reply via email to