Akshat-Jain opened a new pull request, #17700:
URL: https://github.com/apache/druid/pull/17700
### Description
This PR has 4 parts:
1. Creation of a new workflow
2. Moving unit tests (phase 1) to the new workflow
3. Moving unit tests (phase 2) to a weekly cron workflow
4. New reporting mechanism
I'll describe the above 4 parts separately below.
## Creation of a new workflow
This PR adds 2 new workflow yml files:
- `ci.yml`
- `worker.yml`
Eventually, `ci.yml` is expected to be the new starting point for our entire
CI workflow. However, right now, since we're migrating only the unit tests
(phase 1) to use ci.yml, it's not the starting point for now (during this
transition of individual parts one-by-one to the new workflow).
`ci.yml` is responsible for orchestrating the workflows with the help of a
`worker.yml`.
`worker.yml` is responsible for doing the actual work, as per requested by
`ci.yml` via the input arguments.
Motivation behind this change:
- Currently, our workflows are spread across a bunch of yml files, making it
difficult to understand the flows.
- Having a central file `ci.yml` will make it much easier to understand the
dependency graph between all the different workflows.
- `worker.yml` takes a `script` as one of the input args. The idea behind
this design is to make it easy for folks to be able to run the same scripts
locally, whenever they require to run the workflows locally (during development
or debugging). This becomes difficult to figure out when we have a bunch of yml
files and we're passing around input args through them, hence we wanted to keep
it as simple as possible to understand and reproduce locally.
## Moving unit tests (phase 1) to the new workflow
The unit tests (phase 1) that was being run via `unit-tests.yml`,
`reusable-unit-tests.yml` and `unit_tests_script.sh` - has been updated to run
via `ci.yml`, `worker.yml` and `run-unit-tests.sh`. As a result,
`unit-tests.yml`, `reusable-unit-tests.yml` and `unit_tests_script.sh` have
also been removed since they're unused now (the same has been done for unit
tests phase 2, as can be read in the next section).
Currently, we were running unit tests for 4 different set of modules:
`indexing`, `processing`, `server`, and "everything else". This resulted in
longer duration of the job, with processing modules taking 50+ minutes, for
example. Also, the design with "everything else" wouldn't scale in the long
term, since it's everything that isn't covered in the other 3 modules.
This design has been updated to run using regex patterns, on the starting
character of the test class. We have the following in `ci.yml`:
```
matrix:
Dtest: [ "A*,B*", "C*", "D*,E*,F*", "G*,H*", "I*,J*", "K*,L*", "M*,N*",
"O*,P*", "Q*", "R*", "S*", "T*,U*", "V*,W*,X*,Y*,Z*" ]
```
Hence, all unit test classes starting with `A` and `B` gets run as part of a
single job, across all modules. Similar thing happens for `C`, similar for
`D`,`E`,`F`, and so on.
The character groupings have been done to try and avoid skewness across
them. This also allows scaling in the long term, as well as the flexibility to
tune these groupings, if and when needed.
With this change, most groups finish in around 15-18 minutes, and a few of
them finishing in around 25 minutes.
The reporting mechanism for this design of running unit tests is being
covered in the `New reporting mechanism` section below.
## Moving unit tests (phase 2) to a weekly cron workflow
For context, the unit tests phase 1 are being run against JDK 17, whereas
the unit tests phase 2 are being run against JDKs 11 and 21.
It's very rare for a test to have different result in JDK 17 versus JDK
11/21, hence it is redundant to run unit tests phase 2 on every commit on every
PR.
Hence, the unit tests phase 2 have been removed from the regular PR
workflow, and instead have been added as a weekly cron workflow. This ensures
that we are still running them once a week, and allows us to track failures, if
any (which we ideally don't expect to have).
Unit tests phase 2 have also been updated to use the new workflow files and
the new regex pattern approach, as done for unit tests phase 1.
## New reporting mechanism
There are a few aspects to this:
1. The unit test jobs are called as success when they were able to run all
the tests as per the input parameters, even if some of the tests fail. This can
be seen in `ci.yml` where we are passing `-Dmaven.test.failure.ignore=true`
while running unit tests.
2. All unit test jobs upload the surefire reports to individual artifact,
with the same prefix.
3. We have a `reporting-unit-test-failures` job, which downloads all the
artifacts uploaded in point (2) - hence gathering all the individual surefire
reports in the same place/hierarchy, and then runs a reporter
(`mikepenz/action-junit-report@v5`) against those gathered surefire reports.
I'll add some sample run links below for how the report looks like.
4. We have a `reporting-jacoco-coverage-failures` job, which does something
similar, and then uses `create-jacoco-coverage-report.sh` to create a jacoco
report. This new script is mostly derived (with some simplifications) from the
existing `unit_tests_script.sh` we had, but the overall logic and the way of
reporting (that is, the final output) is the same as what we had currently.
Some examples for the unit test report:
1. The failed tests can be seen in the `Summary` section of the workflow
run:
https://github.com/Akshat-Jain/druid/actions/runs/13152426940/attempts/1#summary-36703177495
2. The failed tests can also be seen in the `report-unit-test-failures` job
run itself:
https://github.com/Akshat-Jain/druid/actions/runs/13152426940/job/36703177495
(you need to expand the `Publish results` dropdown).
Attaching sample images for both of the above examples below, for quicker
reference:
<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/4f7d173d-6499-4da4-8bd1-2c0d6533b745"
/>
<p align="center"><b>Image for (1)</b></p>
<img width="1716" alt="image"
src="https://github.com/user-attachments/assets/258a6521-df46-4efc-ac27-d5de397caad2"
/>
<p align="center"><b>Image for (2)</b></p>
<hr>
This PR has:
- [x] been self-reviewed.
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]