Re: [URGENT] Reducing our usage of GitHub Runners

Sebastien Lorquet Wed, 23 Oct 2024 02:00:54 -0700

Hi,

This is a complex topic and I do not think it can be solved by tech only.

If compilation errors increases, this just means one thing: Somedevelopers send untested code changes that were not verified properlybefore submission by said developers.


Developers sending bad code sounds inacceptable to me.

The pace of contributions is too high. Instead it should match whatreviewers and maintainers can maintain.

My solution is human: Develop slower. Aiming for careless growth is nota good thing in general.

If a shared open source project cannot follow the path of the mostactive developers, then these developers should work on their own fork,and only submit proper contributions to upstream.

Thats how linux work. If you sent non functional pull requests to linustorvalds, you would be flamed for sending garbage.



Thats how it should be done here, imho.

The solution is not more resources (you will never get them), it's lessdepletion of available resources.



Sebastien


On 23/10/2024 10:35, raiden00pl wrote:

Sebastian, the practice of recent days shows something completely
different. Without CI coverage,
compilation errors become common. Building all the configurations locally
to verify the changes will take
ages on most machines, and building for different host OSes is often not
possible for users.

With such a complex project as NuttX, with many Kconfig options and
dependencies, such a trivial
thing as breaking the compilation is a HUGE problem.
Take all the NuttX features, multiply them across all the architectures and
boards and you have a project that is
impossible to track without automation and with such a small team.

If you could propose a better solution (and implement it), everyone would
be happy.
Until then, we have what we have and it doesn't look like it will get any
better.
Although verification of simple changes has been greatly improved recently
thanks to Lup,
so one-line PRs affecting certain parts of the OS (like boards and archs)
should be much faster to verify.

śr., 23 paź 2024 o 10:06 Sebastien Lorquet <sebast...@lorquet.fr>
napisał(a):

Hi,

Maybe I'm not the only one thinking that more than 6 hours of build
checks for one-liner pull requests is excessive?

More so when said hours of work test nothing of the actual effect of
these changes.

:):):)

Sebastien


On 22/10/2024 15:49, Alan C. Assis wrote:

Hi Nathan,

Thank you for the link. I don't know if this Pulsar will alleviate the CI
actions limitation that we are facing.

I think someone from Apache needs to answer these questions Lup raised
here:

https://github.com/apache/nuttx/issues/14376#issuecomment-2428107029

"Why are all ASF Projects subjected to the same quotas? And why can't we
increase the quota if we happen to have additional funding?"

Many projects are not using it at all and still have the same quote that
NuttX (the 5th most active project under Apache umbrella).

I remember Greg said that when moving to Apache we will have all the
resources we were looking for a long time, like: CI, hardware test
integration, funding for our events, travel assistance, etc.

BR,

Alan

On Tue, Oct 22, 2024 at 10:18 AM Nathan Hartman <

hartman.nat...@gmail.com>

wrote:

Hi folks,

The following email was posted to builds@ today and might contain
something
relevant to reducing our GitHub runners? Forwarded message below...

[1]
https://lists.apache.org/thread/pnvt9b80dnovlqmrf5n10ylcf9q3pcxq

---------- Forwarded message ---------
From: Lari Hotari <lhot...@apache.org>
Date: Tue, Oct 22, 2024 at 7:08 AM
Subject: Sharing Apache Pulsar's CI solution for Docker image sharing

with

GitHub Actions Artifacts within a single workflow
To: <bui...@apache.org>


Hi all,

Just in case it's useful for someone else, in Apache Pulsar, there's a
GitHub Actions-based CI workflow that creates a Docker image and runs
integration tests and system tests with it. In Pulsar, we have an

extremely

large Docker image for system tests; it's over 1.7GiB when compressed

with

zstd. Building this image takes over 20 minutes, so we want to share the
image within a single build workflow. GitHub Artifacts are the

recommended

way to share files between jobs in a single workflow, as explained in

the

GitHub Actions documentation:

https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow

   .

To share the Docker image within a single build workflow, we use GitHub
Artifacts upload/download with a custom CLI tool that uses the
GitHub-provided JavaScript libraries for interacting with the GitHub
Artifacts backend API. The benefit of the CLI tool for GitHub Actions
Artifacts is that it can upload from stdin and download to stdout.

Sharing

the Docker images in the GitHub Actions workflow is simply done with the
CLI tool and standard "docker load" and "docker save" commands.

These are the shell script functions that Apache Pulsar uses:

https://github.com/apache/pulsar/blob/1344167328c31ea39054ec2a6019f003fb8bab50/build/pulsar_ci_tool.sh#L82-L101

In Pulsar CI, the command for saving the image is:
docker save ${image} | zstd | pv -ft -i 5 | pv -Wbaf -i 5 | timeout 20m
gh-actions-artifact-client.js upload
--retentionDays=$ARTIFACT_RETENTION_DAYS "${artifactname}"

For restoring, the command used is:
timeout 20m gh-actions-artifact-client.js download "${artifactname}" |

pv

-batf -i 5 | unzstd | docker load

The throughput is very impressive. Transfer speed can exceed 180MiB/s

when

uploading the Docker image, and downloads are commonly over 100MiB/s in
apache/pulsar builds. It's notable that the transfer includes the

execution

of "docker load" and "docker save" since it's directly operating on

stdin

and stdout.
Examples:
upload:

https://github.com/apache/pulsar/actions/runs/11454093832/job/31880154863#step:15:26

download:

https://github.com/apache/pulsar/actions/runs/11454093832/job/31880164467#step:9:20

Since GitHub Artifacts doesn't provide an official CLI tool, I have

written

a GitHub Action for that purpose. It's available at
https://github.com/lhotari/gh-actions-artifact-client.
When you use the action, it will install the CLI tool available as
"gh-actions-artifact-client.js" in the PATH of the runner so that it's
available in subsequent build steps. In Apache Pulsar, we fork external
actions to our own repository, so we use the version forked to
https://github.com/apache/pulsar-test-infra.

In Pulsar, we have been using this solution successfully for several

years.

I recently upgraded the action to support the GitHub Actions Artifacts

API

v4, as earlier API versions will be removed after November 30th.

I hope this helps other projects that face similar CI challenges as

Pulsar

has. Please let me know if you need any help in using a similar solution
for your Apache project's CI.

-Lari

(end of forwarded message)

WDYT? Relevant to us?

Cheers,
Nathan

On Thu, Oct 17, 2024 at 2:10 AM Lee, Lup Yuen <lu...@appkaki.com>

wrote:

Hi All: We have an ultimatum to reduce (drastically) our usage of

GitHub

Actions. Or our Continuous Integration will halt totally in Two Weeks.
Here's what I'll implement within 24 hours for `nuttx` and `nuttx-apps`
repos:

(1) When we submit or update a Complex PR that affects All

Architectures

(Arm, RISC-V, Xtensa, etc): CI Workflow shall run only half the jobs.
Previously CI Workflow will run `arm-01` to `arm-14`, now we will run

only

`arm-01` to `arm-07`. (This will reduce GitHub Cost by 32%)

(2) When the Complex PR is Merged: CI Workflow will still run all jobs
`arm-01` to `arm-14`

(3) For NuttX Admins: We shall have only Four Scheduled Merge Jobs per

day.

Which means I shall quickly cancel any Merge Jobs that appear. Then at
00:00 / 06:00 / 12:00 / 18:00 UTC: I shall restart the Latest Merge Job
that I cancelled.  (This will reduce GitHub Cost by 17%)

(4) macOS and Windows Jobs (msys2 / msvc): They shall be totally

disabled

until we find a way to manage their costs. (GitHub charges 10x premium

for

macOS runners, 2x premium for Windows runners!)

We have done an Analysis of CI Jobs over the past 24 hours:

- Many CI Jobs are Incomplete: We waste GitHub Runners on jobs that
eventually get superseded and cancelled

- When we Half the CI Jobs: We reduce the wastage of GitHub Runners

- Scheduled Merge Jobs will also reduce wastage of GitHub Runners,

since

most Merge Jobs don't complete (only 1 completed yesterday)

Please check out the analysis below. And let's discuss further in this
NuttX Issue. Thanks!

https://github.com/apache/nuttx/issues/14376

Lup

---------- Forwarded message ---------
From: Daniel Gruno <humbed...@apache.org>
Date: Wed, Oct 16, 2024 at 12:08 PM
Subject: [WARNING] All NuttX builds to be turned off by October 30th
UNLESS...
To: <priv...@nuttx.apache.org>
Cc: ASF Infrastructure <priv...@infra.apache.org>

Hello again, NuttX folks.
This is a formal notice that your CI builds are far exceeding the
maximum resource use set out by our CI policies[1]. As you are

currently

exceeding your limits by more than 300%[2] and have not shown any

signs

of decreasing, we will be disabling GitHub Actions for your project

on

October 30th unless you manage to get the usage under control and

below

the established limits of 25 full-time runners in a single week.

If you have any further questions, feel free to reach out to us at
priv...@infra.apache.org

With regards,
Daniel on behalf of ASF Infra.


[1] https://infra.apache.org/github-actions-policy.html
[2] https://infra-reports.apache.org/#ghactions&project=nuttx

Re: [URGENT] Reducing our usage of GitHub Runners

Reply via email to