Baunsgaard opened a new pull request, #2492:
URL: https://github.com/apache/systemds/pull/2492

   The **.component.c**.** Java test job still has intermittently runs until 
the 30-minute GitHub Actions cap with no further output: a surefire fork stalls 
in a way that surefire's own timeouts never catch (a fork wedged around the 
booter handshake, or a starved maven parent), so neither 
forkedProcessTimeoutInSeconds nor forkedProcessExitTimeoutInSeconds fires and 
the job is cancelled with nothing to diagnose. The stall does not reproduce 
locally, so the only place to capture evidence is CI.
   
   Add an outer guard in the docker test entrypoint that watches the test log 
for a stall (no new line for a window kept just above the 600s per-fork 
surefire timeout) and an absolute runtime ceiling below the job cap. On either 
trigger it force-dumps thread stacks from every JVM in the test process tree 
via SIGQUIT (relayed into the job log) plus a jstack file backup, then 
force-kills the tree so the job fails fast WITH stacks instead of being 
cancelled empty-handed. Limits are overridable via SYSDS_TEST_STALL_LIMIT and 
SYSDS_TEST_MAX_RUNTIME.
   
   Also set surefire runOrder to alphabetical so the hang reproduces at a 
stable class boundary across runs, making the responsible class identifiable 
from the captured dumps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to