Akshat-Jain opened a new pull request, #17907:
URL: https://github.com/apache/druid/pull/17907

   ### Description
   
   Lately, there have been quite a few occurrences of `standard-its / 
(Compile=openjdk17, Run=openjdk17, Cluster Build On K8s` standard integration 
test (that triggers `ITNestedQueryPushDownTest`) being stuck, and failing after 
6 hours of being stuck. Sample run: 
https://github.com/apache/druid/actions/runs/14400145606/job/40386733833
   
   TLDR: This PR updates the Maven command for this particular GitHub Actions 
Job to exclude `web-console` module. It's fine to do so as it doesn't have 
anything to do with the test anyway.
   
   Adding detailed investigation notes below:
   
   On investigating, all such failed runs seem to be stuck at the following 
line in `setup_druid_on_k8s.sh`:
   ```
   mvn -B -ff -q \
         install \
         -Pdist,bundle-contrib-exts \
         -Pskip-static-checks,skip-tests \
         -Dmaven.javadoc.skip=true -T1C
   ```
   
   On removing `-q` (quiet mode) from the above command, I found that the above 
command was hung immediately after having successfully finished with 
`benchmarks` module.
   
   I tried adding some debug steps in the workflow to get some monitoring data 
for a failed run vs a successful run.
   
   Failed run had a `npm ci` process (which is most likely the stuck process), 
whereas the successful one didn't. Also, the reactor build order that gets used 
has `web-console` module immediately after `benchmarks` module, hence this 
observation aligns with the previous observation that the command seemed stuck 
immediately after having successfully finished with `benchmarks` module.
   
   This PR mainly excludes `web-console` module from the Maven command by 
adding `-pl '!web-console'`. Apart from this, a couple other things have been 
done:
   1. The Maven command has been moved from `setup_druid_on_k8s.sh` to 
`standard-its.yml`. Previously, `standard-its.yml` was calling 
`MAVEN_OPTS='-Xmx2048m' ${MVN} verify -pl integration-tests -P 
int-tests-config-file ${IT_TEST} ${MAVEN_SKIP} -Dpod.name=${POD_NAME} 
-Dpod.namespace=${POD_NAMESPACE} -Dbuild.druid.cluster=${BUILD_DRUID_CLUSTER}`, 
which ended up calling a chain of bash scripts, which called the other Maven 
command, which then ended up stuck. Essentially, a Maven command was triggering 
another Maven command via a chain of bash scripts. This nested Maven commands 
could potentially run into other issues, like competing for lock on the local 
Maven repo etc, which are very difficult to debug. Hence, I brought that 
"inner" Maven command to the same level as the "outer" Maven command, and now 
they are running sequentially. It's easier to reason about the flow as well 
this way.
   2. I have also added `set -x` to enable debugging for the bash scripts 
involved in running this IT. I think it's better to have them, it's not a lot 
of noise, and those scripts are only used for this IT. So I think it'd be good 
to improve the ability to debug for future. On a similar note, I have removed 
the `-q` (quiet mode) argument from the Maven command.
   
   I tried a bunch of runs for this test with this PR's change, all of them 
passed:
   1. 8/8 successful runs 
[here](https://github.com/Akshat-Jain/druid/actions/runs/14429675741/job/40464412441?pr=7)
   2. 5/5 successful runs 
[here](https://github.com/Akshat-Jain/druid/actions/runs/14429950271/job/40464375268?pr=9)
   3. 2/2 successful runs 
[here](https://github.com/Akshat-Jain/druid/actions/runs/14430480632/job/40464926328?pr=8)
   
   <hr>
   
   This PR has:
   
   - [x] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to