lhotari opened a new pull request #14819:
URL: https://github.com/apache/pulsar/pull/14819
### Motivation
Improve Pulsar CI:
- Reduce GitHub Action Runner resource consumption of Pulsar PR builds
- Currently, Pulsar GitHub Actions workflows are consuming the majority of
the shared pool of resources allocated for github.com/apache projects
- Running the GitHub Actions workflows for a single PR to Pulsar consumes
about 18-20 hours of GitHub Actions Runner VM time. This is too much.
- Reduce lead times for Pull Request feedback by speeding up builds
- Speeds up Pulsar development
- Improves developer productivity since waiting times are reduced
- Since PR feedback is faster, developers can be comfortable submitting
more granular pull requests.
- When development cycle is faster, it is easier to keep the pull request
queue shorter. This has several benefits since when PRs are handled quickly,
there are fewer chances for pull requests to divert from the master branch. It
also reduces merge conflicts and the time wasted in resolving merge conflicts.
- Better usability and access to test reports
- Less time is spent in looking for the reason why a build failed
### Modifications
- The design goal has been to keep the build content as the same as before
the refactoring. The same tests are run, but in more effective ways. This
refactoring doesn't make changes to the way how test retries are handled.
- Combine most of the Pulsar CI workflows into a single workflow called
"Pulsar CI"
- The workflows that benefit of the aggregation have been chosen.
- the modifications reuse binary artifacts in the workflow and this
reduces the resource consumption.
- Pulsar core modules jar files are built once and reused.
- Pulsar docker images are built once and reused
- GitHub Actions cache is used to share the files. The capacity of
GitHub Actions cache is 10GB which is scoped to the developer who opens the
pull request. This means that there's plenty of disk space for PR builds (10GB
for each developer).
- Integration tests are categorized into "integration tests" and "system
tests"
- A slimmer docker image `apachepulsar/java-test-image:latest` is used to
run the integration tests that don't depend on Pulsar Python client, Tiered
storage drivers, Pulsar SQL or Pulsar Connectors.
- The previous `apachepulsar/pulsar-test-latest-version:latest` image is
used to run the integration tests that are categorized as "system tests".
- The benefit of this split is that the java-test-image builds in about 6
minutes and can start the downstream integration test jobs after this. This
results in faster developer feedback.
- For debugging builds, there's configuration for exposing ssh shell access
to each Build VM to the user who triggered the build ("github actor"). The ssh
access is authenticated with the SSH key that the user has registered in
GitHub.
- ssh access is only active in own forks. It is not enabled in
`apache/pulsar` because of security concerns.
- A developer can open a PR to their own fork (for example with a single
command with GH cli `gh pr create --repo=githubusername/pulsar --base master
--head "$(git branch --show-current)" -f`) to run the build with ssh access
enabled.
- ssh access is active for the duration of the build. If the build fails,
the build waits 5 minutes for a developer to connect to investigate the
problem. (this behavior is not enabled in `apache/pulsar`)
The SSH shell access feature will make it easier to debug CI issues which
don't get resolved with the information in the GitHub Actions UI. This is an
important capability to have available whenever there are problems. As
described above, the configuration requires to run the build in a developer's
personal fork of the pulsar repository to activate the feature.
- Fix broken configuration in `.github/actions/tune-runner-vm/action.yml`
which was broken with PR #13252.
- The makes Linux kernel's vm swappiness setting effectively `1` for all
cgroups.
- Helps prevent swapping when the VM is running low on memory.
- Improve test reporting by the use of
https://github.com/dorny/test-reporter . The test reports get attached to the
wrong workflow because of a GitHub Actions limitation. That reduces the
usability since the test reports are harder to find. test-reporter renders the
Junit XML files to the GitHub Actions UI.
- Improve test reporting by adding warning annotations about the test
statistics.
- not really warnings, but GitHub Actions doesn't seem to allow info
annotations from shell scripts.
- Use GitHub Action built-in feature to cancel duplicate build jobs:
```
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
```
- a new push to a PR will trigger a new job and this feature will be used
to cancel the previous build which is obsolete
- this solution might be more effective than the current solution to
cancel duplicate jobs
### Additional Context
The work in this PR was mainly done last year while working on a
proof-of-concept of the GitHub Actions refactoring.
There's a Google document [[Discuss] PIP Changes to GitHub Actions based
Pulsar
CI](https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#heading=h.f53rkcu20sry)
which describes details about some technical solutions. There's also an [email
thread on the dev mailing
list](https://lists.apache.org/thread/ra2fcf7b973448bb51e00aceaeed06433e8d886270b0f0db0c80d4e0c@%3Cdev.pulsar.apache.org%3E).
The showstopper a year ago was the lack of being able to re-run a single
failed job in a larger workflow.
GitHub has since then delivered this feature and no showstoppers are present.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]