Re: [URGENT] Reducing our usage of GitHub Runners

raiden00pl Wed, 23 Oct 2024 02:50:52 -0700

> If compilation errors increases, this just means one thing: Some
developers send untested code changes that were not verified properly
before submission by said developers.


I agree, but is verification of every aspect of the OS possible for a
single contributor?
I don't think so. Just compiling all possible configurations affected by
the change seems unrealistic,
verifying all affected boards on HW is even less possible. Therefore, a
hardware test farm would be a great
helps, but I doubt it'll ever happen without the support from the companies
relying on NuttX.

> The pace of contributions is too high. Instead it should match what
reviewers and maintainers can maintain.
> My solution is human: Develop slower. Aiming for careless growth is not
a good thing in general.

This makes sense, but comes with a huge risk: the project may die.
Slow development -> companies less interested in the project -> fewer
contributors/reviewers -> even slower development
NuttX is not copyleft, so the incentive to contribute changes is lower than
in e.g. Linux.

I don't think we even have enough reviewers to cover every part of the OS.
Let's say ports of chips added by a committer who appears here a few times
a year,
code that no one else knows and has no way to test.

> If a shared open source project cannot follow the path of the most
active developers, then these developers should work on their own fork,
and only submit proper contributions to upstream.

I think that's exactly how it works now. But we're back to the problem I'm
talking about:
NuttX is too complicated to test and verify all possible implications of a
given change without automation.
Contributor changes are most likely being tested in some way, but testing
all
possible cases is not physically possible.

> Thats how linux work. If you sent non functional pull requests to linus
torvalds, you would be flamed for sending garbage.

Yeah, we are too nice here :)

śr., 23 paź 2024 o 10:59 Sebastien Lorquet <[email protected]>
napisał(a):

> Hi,
>
> This is a complex topic and I do not think it can be solved by tech only.
>
> If compilation errors increases, this just means one thing: Some
> developers send untested code changes that were not verified properly
> before submission by said developers.
>
> Developers sending bad code sounds inacceptable to me.
>
>
> The pace of contributions is too high. Instead it should match what
> reviewers and maintainers can maintain.
>
> My solution is human: Develop slower. Aiming for careless growth is not
> a good thing in general.
>
>
> If a shared open source project cannot follow the path of the most
> active developers, then these developers should work on their own fork,
> and only submit proper contributions to upstream.
>
> Thats how linux work. If you sent non functional pull requests to linus
> torvalds, you would be flamed for sending garbage.
>
>
> Thats how it should be done here, imho.
>
> The solution is not more resources (you will never get them), it's less
> depletion of available resources.
>
>
> Sebastien
>
>
> On 23/10/2024 10:35, raiden00pl wrote:
> > Sebastian, the practice of recent days shows something completely
> > different. Without CI coverage,
> > compilation errors become common. Building all the configurations locally
> > to verify the changes will take
> > ages on most machines, and building for different host OSes is often not
> > possible for users.
> >
> > With such a complex project as NuttX, with many Kconfig options and
> > dependencies, such a trivial
> > thing as breaking the compilation is a HUGE problem.
> > Take all the NuttX features, multiply them across all the architectures
> and
> > boards and you have a project that is
> > impossible to track without automation and with such a small team.
> >
> > If you could propose a better solution (and implement it), everyone would
> > be happy.
> > Until then, we have what we have and it doesn't look like it will get any
> > better.
> > Although verification of simple changes has been greatly improved
> recently
> > thanks to Lup,
> > so one-line PRs affecting certain parts of the OS (like boards and archs)
> > should be much faster to verify.
> >
> > śr., 23 paź 2024 o 10:06 Sebastien Lorquet <[email protected]>
> > napisał(a):
> >
> >> Hi,
> >>
> >> Maybe I'm not the only one thinking that more than 6 hours of build
> >> checks for one-liner pull requests is excessive?
> >>
> >> More so when said hours of work test nothing of the actual effect of
> >> these changes.
> >>
> >> :):):)
> >>
> >> Sebastien
> >>
> >>
> >> On 22/10/2024 15:49, Alan C. Assis wrote:
> >>> Hi Nathan,
> >>>
> >>> Thank you for the link. I don't know if this Pulsar will alleviate the
> CI
> >>> actions limitation that we are facing.
> >>>
> >>> I think someone from Apache needs to answer these questions Lup raised
> >>> here:
> >> https://github.com/apache/nuttx/issues/14376#issuecomment-2428107029
> >>> "Why are all ASF Projects subjected to the same quotas? And why can't
> we
> >>> increase the quota if we happen to have additional funding?"
> >>>
> >>> Many projects are not using it at all and still have the same quote
> that
> >>> NuttX (the 5th most active project under Apache umbrella).
> >>>
> >>> I remember Greg said that when moving to Apache we will have all the
> >>> resources we were looking for a long time, like: CI, hardware test
> >>> integration, funding for our events, travel assistance, etc.
> >>>
> >>> BR,
> >>>
> >>> Alan
> >>>
> >>> On Tue, Oct 22, 2024 at 10:18 AM Nathan Hartman <
> >> [email protected]>
> >>> wrote:
> >>>
> >>>> Hi folks,
> >>>>
> >>>> The following email was posted to builds@ today and might contain
> >>>> something
> >>>> relevant to reducing our GitHub runners? Forwarded message below...
> >>>>
> >>>> [1]
> >>>> https://lists.apache.org/thread/pnvt9b80dnovlqmrf5n10ylcf9q3pcxq
> >>>>
> >>>> ---------- Forwarded message ---------
> >>>> From: Lari Hotari <[email protected]>
> >>>> Date: Tue, Oct 22, 2024 at 7:08 AM
> >>>> Subject: Sharing Apache Pulsar's CI solution for Docker image sharing
> >> with
> >>>> GitHub Actions Artifacts within a single workflow
> >>>> To: <[email protected]>
> >>>>
> >>>>
> >>>> Hi all,
> >>>>
> >>>> Just in case it's useful for someone else, in Apache Pulsar, there's a
> >>>> GitHub Actions-based CI workflow that creates a Docker image and runs
> >>>> integration tests and system tests with it. In Pulsar, we have an
> >> extremely
> >>>> large Docker image for system tests; it's over 1.7GiB when compressed
> >> with
> >>>> zstd. Building this image takes over 20 minutes, so we want to share
> the
> >>>> image within a single build workflow. GitHub Artifacts are the
> >> recommended
> >>>> way to share files between jobs in a single workflow, as explained in
> >> the
> >>>> GitHub Actions documentation:
> >>>>
> >>>>
> >>
> https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow
> >>>>    .
> >>>>
> >>>> To share the Docker image within a single build workflow, we use
> GitHub
> >>>> Artifacts upload/download with a custom CLI tool that uses the
> >>>> GitHub-provided JavaScript libraries for interacting with the GitHub
> >>>> Artifacts backend API. The benefit of the CLI tool for GitHub Actions
> >>>> Artifacts is that it can upload from stdin and download to stdout.
> >> Sharing
> >>>> the Docker images in the GitHub Actions workflow is simply done with
> the
> >>>> CLI tool and standard "docker load" and "docker save" commands.
> >>>>
> >>>> These are the shell script functions that Apache Pulsar uses:
> >>>>
> >>>>
> >>
> https://github.com/apache/pulsar/blob/1344167328c31ea39054ec2a6019f003fb8bab50/build/pulsar_ci_tool.sh#L82-L101
> >>>> In Pulsar CI, the command for saving the image is:
> >>>> docker save ${image} | zstd | pv -ft -i 5 | pv -Wbaf -i 5 | timeout
> 20m
> >>>> gh-actions-artifact-client.js upload
> >>>> --retentionDays=$ARTIFACT_RETENTION_DAYS "${artifactname}"
> >>>>
> >>>> For restoring, the command used is:
> >>>> timeout 20m gh-actions-artifact-client.js download "${artifactname}" |
> >> pv
> >>>> -batf -i 5 | unzstd | docker load
> >>>>
> >>>> The throughput is very impressive. Transfer speed can exceed 180MiB/s
> >> when
> >>>> uploading the Docker image, and downloads are commonly over 100MiB/s
> in
> >>>> apache/pulsar builds. It's notable that the transfer includes the
> >> execution
> >>>> of "docker load" and "docker save" since it's directly operating on
> >> stdin
> >>>> and stdout.
> >>>> Examples:
> >>>> upload:
> >>>>
> >>>>
> >>
> https://github.com/apache/pulsar/actions/runs/11454093832/job/31880154863#step:15:26
> >>>> download:
> >>>>
> >>>>
> >>
> https://github.com/apache/pulsar/actions/runs/11454093832/job/31880164467#step:9:20
> >>>> Since GitHub Artifacts doesn't provide an official CLI tool, I have
> >> written
> >>>> a GitHub Action for that purpose. It's available at
> >>>> https://github.com/lhotari/gh-actions-artifact-client.
> >>>> When you use the action, it will install the CLI tool available as
> >>>> "gh-actions-artifact-client.js" in the PATH of the runner so that it's
> >>>> available in subsequent build steps. In Apache Pulsar, we fork
> external
> >>>> actions to our own repository, so we use the version forked to
> >>>> https://github.com/apache/pulsar-test-infra.
> >>>>
> >>>> In Pulsar, we have been using this solution successfully for several
> >> years.
> >>>> I recently upgraded the action to support the GitHub Actions Artifacts
> >> API
> >>>> v4, as earlier API versions will be removed after November 30th.
> >>>>
> >>>> I hope this helps other projects that face similar CI challenges as
> >> Pulsar
> >>>> has. Please let me know if you need any help in using a similar
> solution
> >>>> for your Apache project's CI.
> >>>>
> >>>> -Lari
> >>>>
> >>>> (end of forwarded message)
> >>>>
> >>>> WDYT? Relevant to us?
> >>>>
> >>>> Cheers,
> >>>> Nathan
> >>>>
> >>>> On Thu, Oct 17, 2024 at 2:10 AM Lee, Lup Yuen <[email protected]>
> >> wrote:
> >>>>> Hi All: We have an ultimatum to reduce (drastically) our usage of
> >> GitHub
> >>>>> Actions. Or our Continuous Integration will halt totally in Two
> Weeks.
> >>>>> Here's what I'll implement within 24 hours for `nuttx` and
> `nuttx-apps`
> >>>>> repos:
> >>>>>
> >>>>> (1) When we submit or update a Complex PR that affects All
> >> Architectures
> >>>>> (Arm, RISC-V, Xtensa, etc): CI Workflow shall run only half the jobs.
> >>>>> Previously CI Workflow will run `arm-01` to `arm-14`, now we will run
> >>>> only
> >>>>> `arm-01` to `arm-07`. (This will reduce GitHub Cost by 32%)
> >>>>>
> >>>>> (2) When the Complex PR is Merged: CI Workflow will still run all
> jobs
> >>>>> `arm-01` to `arm-14`
> >>>>>
> >>>>> (3) For NuttX Admins: We shall have only Four Scheduled Merge Jobs
> per
> >>>> day.
> >>>>> Which means I shall quickly cancel any Merge Jobs that appear. Then
> at
> >>>>> 00:00 / 06:00 / 12:00 / 18:00 UTC: I shall restart the Latest Merge
> Job
> >>>>> that I cancelled.  (This will reduce GitHub Cost by 17%)
> >>>>>
> >>>>> (4) macOS and Windows Jobs (msys2 / msvc): They shall be totally
> >> disabled
> >>>>> until we find a way to manage their costs. (GitHub charges 10x
> premium
> >>>> for
> >>>>> macOS runners, 2x premium for Windows runners!)
> >>>>>
> >>>>> We have done an Analysis of CI Jobs over the past 24 hours:
> >>>>>
> >>>>> - Many CI Jobs are Incomplete: We waste GitHub Runners on jobs that
> >>>>> eventually get superseded and cancelled
> >>>>>
> >>>>> - When we Half the CI Jobs: We reduce the wastage of GitHub Runners
> >>>>>
> >>>>> - Scheduled Merge Jobs will also reduce wastage of GitHub Runners,
> >> since
> >>>>> most Merge Jobs don't complete (only 1 completed yesterday)
> >>>>>
> >>>>> Please check out the analysis below. And let's discuss further in
> this
> >>>>> NuttX Issue. Thanks!
> >>>>>
> >>>>> https://github.com/apache/nuttx/issues/14376
> >>>>>
> >>>>> Lup
> >>>>>
> >>>>>
> >>>>>>> ---------- Forwarded message ---------
> >>>>>>> From: Daniel Gruno <[email protected]>
> >>>>>>> Date: Wed, Oct 16, 2024 at 12:08 PM
> >>>>>>> Subject: [WARNING] All NuttX builds to be turned off by October
> 30th
> >>>>>>> UNLESS...
> >>>>>>> To: <[email protected]>
> >>>>>>> Cc: ASF Infrastructure <[email protected]>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hello again, NuttX folks.
> >>>>>>> This is a formal notice that your CI builds are far exceeding the
> >>>>>>> maximum resource use set out by our CI policies[1]. As you are
> >>>> currently
> >>>>>>> exceeding your limits by more than 300%[2] and have not shown any
> >>>> signs
> >>>>>>> of decreasing, we will be disabling GitHub Actions for your project
> >> on
> >>>>>>> October 30th unless you manage to get the usage under control and
> >>>> below
> >>>>>>> the established limits of 25 full-time runners in a single week.
> >>>>>>>
> >>>>>>> If you have any further questions, feel free to reach out to us at
> >>>>>>> [email protected]
> >>>>>>>
> >>>>>>> With regards,
> >>>>>>> Daniel on behalf of ASF Infra.
> >>>>>>>
> >>>>>>>
> >>>>>>> [1] https://infra.apache.org/github-actions-policy.html
> >>>>>>> [2] https://infra-reports.apache.org/#ghactions&project=nuttx
> >>>>>>>
>

Re: [URGENT] Reducing our usage of GitHub Runners

Reply via email to