Re: [URGENT] Reducing our usage of GitHub Runners

Tomek CEDRO Wed, 23 Oct 2024 15:02:27 -0700

CI automation is VERY important with that scale of a project and
incoming changes of various quality.. new contributions proves project
is growing fast.. and the new hardware platforms show up rapidly that
we want to support.. limited CI resources will impact project quality
as it really helps to catch most of the common problems on various
build hosts and various configurations.


Also helping in PR reviews is mostly welcome. From what I can see
various aspects of the project can be discovered that way that lead to
better understanding and improved following reviews. If I dont
understand something I do not review or ask someone who may know the
details best :-)

Some alternative would be nice to have though, like distributed test
farms with runtime tests on real boards, we talked about that, I have
some basic setup built in progress, but I probably can bring more
focus to this in 2025Q1 as currently overloaded with work in several
areas (as probably we all are).

Long story short I do not think that shrinking CI is a good idea
(except for what we are forced to shrink), and we should not slow down
the project because of limited resources. We should focus on better
quality releases, just as we do with 12.7.0, we delay the release
because of important fixes and features that will land here, this is
most important for the current and new end-users. Nowadays (and for me
personally) it is crucial that trusted things work out of the box even
if they lag a bit behind so called "bleeding edge" and I agree things
tend to change too fast to even build anything sensible out of them
:-)

Have a good day folks :-)
Tomek


On Wed, Oct 23, 2024 at 11:51 AM raiden00pl <[email protected]> wrote:
>
> > If compilation errors increases, this just means one thing: Some
> developers send untested code changes that were not verified properly
> before submission by said developers.
>
> I agree, but is verification of every aspect of the OS possible for a
> single contributor?
> I don't think so. Just compiling all possible configurations affected by
> the change seems unrealistic,
> verifying all affected boards on HW is even less possible. Therefore, a
> hardware test farm would be a great
> helps, but I doubt it'll ever happen without the support from the companies
> relying on NuttX.
>
> > The pace of contributions is too high. Instead it should match what
> reviewers and maintainers can maintain.
> > My solution is human: Develop slower. Aiming for careless growth is not
> a good thing in general.
>
> This makes sense, but comes with a huge risk: the project may die.
> Slow development -> companies less interested in the project -> fewer
> contributors/reviewers -> even slower development
> NuttX is not copyleft, so the incentive to contribute changes is lower than
> in e.g. Linux.
>
> I don't think we even have enough reviewers to cover every part of the OS.
> Let's say ports of chips added by a committer who appears here a few times
> a year,
> code that no one else knows and has no way to test.
>
> > If a shared open source project cannot follow the path of the most
> active developers, then these developers should work on their own fork,
> and only submit proper contributions to upstream.
>
> I think that's exactly how it works now. But we're back to the problem I'm
> talking about:
> NuttX is too complicated to test and verify all possible implications of a
> given change without automation.
> Contributor changes are most likely being tested in some way, but testing
> all
> possible cases is not physically possible.
>
> > Thats how linux work. If you sent non functional pull requests to linus
> torvalds, you would be flamed for sending garbage.
>
> Yeah, we are too nice here :)
>
> śr., 23 paź 2024 o 10:59 Sebastien Lorquet <[email protected]>
> napisał(a):
>
> > Hi,
> >
> > This is a complex topic and I do not think it can be solved by tech only.
> >
> > If compilation errors increases, this just means one thing: Some
> > developers send untested code changes that were not verified properly
> > before submission by said developers.
> >
> > Developers sending bad code sounds inacceptable to me.
> >
> >
> > The pace of contributions is too high. Instead it should match what
> > reviewers and maintainers can maintain.
> >
> > My solution is human: Develop slower. Aiming for careless growth is not
> > a good thing in general.
> >
> >
> > If a shared open source project cannot follow the path of the most
> > active developers, then these developers should work on their own fork,
> > and only submit proper contributions to upstream.
> >
> > Thats how linux work. If you sent non functional pull requests to linus
> > torvalds, you would be flamed for sending garbage.
> >
> >
> > Thats how it should be done here, imho.
> >
> > The solution is not more resources (you will never get them), it's less
> > depletion of available resources.
> >
> >
> > Sebastien
> >
> >
> > On 23/10/2024 10:35, raiden00pl wrote:
> > > Sebastian, the practice of recent days shows something completely
> > > different. Without CI coverage,
> > > compilation errors become common. Building all the configurations locally
> > > to verify the changes will take
> > > ages on most machines, and building for different host OSes is often not
> > > possible for users.
> > >
> > > With such a complex project as NuttX, with many Kconfig options and
> > > dependencies, such a trivial
> > > thing as breaking the compilation is a HUGE problem.
> > > Take all the NuttX features, multiply them across all the architectures
> > and
> > > boards and you have a project that is
> > > impossible to track without automation and with such a small team.
> > >
> > > If you could propose a better solution (and implement it), everyone would
> > > be happy.
> > > Until then, we have what we have and it doesn't look like it will get any
> > > better.
> > > Although verification of simple changes has been greatly improved
> > recently
> > > thanks to Lup,
> > > so one-line PRs affecting certain parts of the OS (like boards and archs)
> > > should be much faster to verify.
> > >
> > > śr., 23 paź 2024 o 10:06 Sebastien Lorquet <[email protected]>
> > > napisał(a):
> > >
> > >> Hi,
> > >>
> > >> Maybe I'm not the only one thinking that more than 6 hours of build
> > >> checks for one-liner pull requests is excessive?
> > >>
> > >> More so when said hours of work test nothing of the actual effect of
> > >> these changes.
> > >>
> > >> :):):)
> > >>
> > >> Sebastien
> > >>
> > >>
> > >> On 22/10/2024 15:49, Alan C. Assis wrote:
> > >>> Hi Nathan,
> > >>>
> > >>> Thank you for the link. I don't know if this Pulsar will alleviate the
> > CI
> > >>> actions limitation that we are facing.
> > >>>
> > >>> I think someone from Apache needs to answer these questions Lup raised
> > >>> here:
> > >> https://github.com/apache/nuttx/issues/14376#issuecomment-2428107029
> > >>> "Why are all ASF Projects subjected to the same quotas? And why can't
> > we
> > >>> increase the quota if we happen to have additional funding?"
> > >>>
> > >>> Many projects are not using it at all and still have the same quote
> > that
> > >>> NuttX (the 5th most active project under Apache umbrella).
> > >>>
> > >>> I remember Greg said that when moving to Apache we will have all the
> > >>> resources we were looking for a long time, like: CI, hardware test
> > >>> integration, funding for our events, travel assistance, etc.
> > >>>
> > >>> BR,
> > >>>
> > >>> Alan
> > >>>
> > >>> On Tue, Oct 22, 2024 at 10:18 AM Nathan Hartman <
> > >> [email protected]>
> > >>> wrote:
> > >>>
> > >>>> Hi folks,
> > >>>>
> > >>>> The following email was posted to builds@ today and might contain
> > >>>> something
> > >>>> relevant to reducing our GitHub runners? Forwarded message below...
> > >>>>
> > >>>> [1]
> > >>>> https://lists.apache.org/thread/pnvt9b80dnovlqmrf5n10ylcf9q3pcxq
> > >>>>
> > >>>> ---------- Forwarded message ---------
> > >>>> From: Lari Hotari <[email protected]>
> > >>>> Date: Tue, Oct 22, 2024 at 7:08 AM
> > >>>> Subject: Sharing Apache Pulsar's CI solution for Docker image sharing
> > >> with
> > >>>> GitHub Actions Artifacts within a single workflow
> > >>>> To: <[email protected]>
> > >>>>
> > >>>>
> > >>>> Hi all,
> > >>>>
> > >>>> Just in case it's useful for someone else, in Apache Pulsar, there's a
> > >>>> GitHub Actions-based CI workflow that creates a Docker image and runs
> > >>>> integration tests and system tests with it. In Pulsar, we have an
> > >> extremely
> > >>>> large Docker image for system tests; it's over 1.7GiB when compressed
> > >> with
> > >>>> zstd. Building this image takes over 20 minutes, so we want to share
> > the
> > >>>> image within a single build workflow. GitHub Artifacts are the
> > >> recommended
> > >>>> way to share files between jobs in a single workflow, as explained in
> > >> the
> > >>>> GitHub Actions documentation:
> > >>>>
> > >>>>
> > >>
> > https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow
> > >>>>    .
> > >>>>
> > >>>> To share the Docker image within a single build workflow, we use
> > GitHub
> > >>>> Artifacts upload/download with a custom CLI tool that uses the
> > >>>> GitHub-provided JavaScript libraries for interacting with the GitHub
> > >>>> Artifacts backend API. The benefit of the CLI tool for GitHub Actions
> > >>>> Artifacts is that it can upload from stdin and download to stdout.
> > >> Sharing
> > >>>> the Docker images in the GitHub Actions workflow is simply done with
> > the
> > >>>> CLI tool and standard "docker load" and "docker save" commands.
> > >>>>
> > >>>> These are the shell script functions that Apache Pulsar uses:
> > >>>>
> > >>>>
> > >>
> > https://github.com/apache/pulsar/blob/1344167328c31ea39054ec2a6019f003fb8bab50/build/pulsar_ci_tool.sh#L82-L101
> > >>>> In Pulsar CI, the command for saving the image is:
> > >>>> docker save ${image} | zstd | pv -ft -i 5 | pv -Wbaf -i 5 | timeout
> > 20m
> > >>>> gh-actions-artifact-client.js upload
> > >>>> --retentionDays=$ARTIFACT_RETENTION_DAYS "${artifactname}"
> > >>>>
> > >>>> For restoring, the command used is:
> > >>>> timeout 20m gh-actions-artifact-client.js download "${artifactname}" |
> > >> pv
> > >>>> -batf -i 5 | unzstd | docker load
> > >>>>
> > >>>> The throughput is very impressive. Transfer speed can exceed 180MiB/s
> > >> when
> > >>>> uploading the Docker image, and downloads are commonly over 100MiB/s
> > in
> > >>>> apache/pulsar builds. It's notable that the transfer includes the
> > >> execution
> > >>>> of "docker load" and "docker save" since it's directly operating on
> > >> stdin
> > >>>> and stdout.
> > >>>> Examples:
> > >>>> upload:
> > >>>>
> > >>>>
> > >>
> > https://github.com/apache/pulsar/actions/runs/11454093832/job/31880154863#step:15:26
> > >>>> download:
> > >>>>
> > >>>>
> > >>
> > https://github.com/apache/pulsar/actions/runs/11454093832/job/31880164467#step:9:20
> > >>>> Since GitHub Artifacts doesn't provide an official CLI tool, I have
> > >> written
> > >>>> a GitHub Action for that purpose. It's available at
> > >>>> https://github.com/lhotari/gh-actions-artifact-client.
> > >>>> When you use the action, it will install the CLI tool available as
> > >>>> "gh-actions-artifact-client.js" in the PATH of the runner so that it's
> > >>>> available in subsequent build steps. In Apache Pulsar, we fork
> > external
> > >>>> actions to our own repository, so we use the version forked to
> > >>>> https://github.com/apache/pulsar-test-infra.
> > >>>>
> > >>>> In Pulsar, we have been using this solution successfully for several
> > >> years.
> > >>>> I recently upgraded the action to support the GitHub Actions Artifacts
> > >> API
> > >>>> v4, as earlier API versions will be removed after November 30th.
> > >>>>
> > >>>> I hope this helps other projects that face similar CI challenges as
> > >> Pulsar
> > >>>> has. Please let me know if you need any help in using a similar
> > solution
> > >>>> for your Apache project's CI.
> > >>>>
> > >>>> -Lari
> > >>>>
> > >>>> (end of forwarded message)
> > >>>>
> > >>>> WDYT? Relevant to us?
> > >>>>
> > >>>> Cheers,
> > >>>> Nathan
> > >>>>
> > >>>> On Thu, Oct 17, 2024 at 2:10 AM Lee, Lup Yuen <[email protected]>
> > >> wrote:
> > >>>>> Hi All: We have an ultimatum to reduce (drastically) our usage of
> > >> GitHub
> > >>>>> Actions. Or our Continuous Integration will halt totally in Two
> > Weeks.
> > >>>>> Here's what I'll implement within 24 hours for `nuttx` and
> > `nuttx-apps`
> > >>>>> repos:
> > >>>>>
> > >>>>> (1) When we submit or update a Complex PR that affects All
> > >> Architectures
> > >>>>> (Arm, RISC-V, Xtensa, etc): CI Workflow shall run only half the jobs.
> > >>>>> Previously CI Workflow will run `arm-01` to `arm-14`, now we will run
> > >>>> only
> > >>>>> `arm-01` to `arm-07`. (This will reduce GitHub Cost by 32%)
> > >>>>>
> > >>>>> (2) When the Complex PR is Merged: CI Workflow will still run all
> > jobs
> > >>>>> `arm-01` to `arm-14`
> > >>>>>
> > >>>>> (3) For NuttX Admins: We shall have only Four Scheduled Merge Jobs
> > per
> > >>>> day.
> > >>>>> Which means I shall quickly cancel any Merge Jobs that appear. Then
> > at
> > >>>>> 00:00 / 06:00 / 12:00 / 18:00 UTC: I shall restart the Latest Merge
> > Job
> > >>>>> that I cancelled.  (This will reduce GitHub Cost by 17%)
> > >>>>>
> > >>>>> (4) macOS and Windows Jobs (msys2 / msvc): They shall be totally
> > >> disabled
> > >>>>> until we find a way to manage their costs. (GitHub charges 10x
> > premium
> > >>>> for
> > >>>>> macOS runners, 2x premium for Windows runners!)
> > >>>>>
> > >>>>> We have done an Analysis of CI Jobs over the past 24 hours:
> > >>>>>
> > >>>>> - Many CI Jobs are Incomplete: We waste GitHub Runners on jobs that
> > >>>>> eventually get superseded and cancelled
> > >>>>>
> > >>>>> - When we Half the CI Jobs: We reduce the wastage of GitHub Runners
> > >>>>>
> > >>>>> - Scheduled Merge Jobs will also reduce wastage of GitHub Runners,
> > >> since
> > >>>>> most Merge Jobs don't complete (only 1 completed yesterday)
> > >>>>>
> > >>>>> Please check out the analysis below. And let's discuss further in
> > this
> > >>>>> NuttX Issue. Thanks!
> > >>>>>
> > >>>>> https://github.com/apache/nuttx/issues/14376
> > >>>>>
> > >>>>> Lup
> > >>>>>
> > >>>>>
> > >>>>>>> ---------- Forwarded message ---------
> > >>>>>>> From: Daniel Gruno <[email protected]>
> > >>>>>>> Date: Wed, Oct 16, 2024 at 12:08 PM
> > >>>>>>> Subject: [WARNING] All NuttX builds to be turned off by October
> > 30th
> > >>>>>>> UNLESS...
> > >>>>>>> To: <[email protected]>
> > >>>>>>> Cc: ASF Infrastructure <[email protected]>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Hello again, NuttX folks.
> > >>>>>>> This is a formal notice that your CI builds are far exceeding the
> > >>>>>>> maximum resource use set out by our CI policies[1]. As you are
> > >>>> currently
> > >>>>>>> exceeding your limits by more than 300%[2] and have not shown any
> > >>>> signs
> > >>>>>>> of decreasing, we will be disabling GitHub Actions for your project
> > >> on
> > >>>>>>> October 30th unless you manage to get the usage under control and
> > >>>> below
> > >>>>>>> the established limits of 25 full-time runners in a single week.
> > >>>>>>>
> > >>>>>>> If you have any further questions, feel free to reach out to us at
> > >>>>>>> [email protected]
> > >>>>>>>
> > >>>>>>> With regards,
> > >>>>>>> Daniel on behalf of ASF Infra.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> [1] https://infra.apache.org/github-actions-policy.html
> > >>>>>>> [2] https://infra-reports.apache.org/#ghactions&project=nuttx
> > >>>>>>>
> >



-- 
CeDeROM, SQ7MHZ, http://www.tomek.cedro.info

Re: [URGENT] Reducing our usage of GitHub Runners

Reply via email to