Baunsgaard opened a new pull request, #2505: URL: https://github.com/apache/systemds/pull/2505
The `**.functions.federated.monitoring.**,**.functions.federated.multitenant.**` CI shard has been intermittently failing and, when reruns pile up, timing out at the 30-minute job cap (e.g. [run 28041416088](https://github.com/apache/systemds/actions/runs/28041416088/job/83008345704)). Root cause: the multi-tenant test config pins `sysds.federated.timeout` to **16s**. This value bounds *both* federated instruction execution (`FederationMap`/`FederatedData`) and end-of-run stats collection (`FederatedStatistics.collectFedStats`). For the Spark-backed (`*SP`) variants of the reuse tests, Spark context creation alone is ~14s, so under shared CI load a single federated request regularly exceeds 16s and throws `TimeoutException`: - `SPARK°rightIndex°… DMLRuntimeException -- java.util.concurrent.TimeoutException` - `Exception … thrown while getting the federated stats of the federated response` The spurious timeouts make `FederatedReuseReadTest.testModifiedValLineageSP` and `FederatedSerializationReuseTest.testRowSumsSP` fail; surefire then reruns each failing test, and the accumulated rerun time repeatedly pushes the shard past the 30-minute cap, cancelling the whole job. This bumps the timeout to **60s**: still a hard bound on a genuinely runaway request (the suite cannot hang silently — the reason it was lowered from 128 → 16 in `8f5a42c0`), but enough headroom for the Spark variants to pass on the first attempt, which also removes the expensive reruns and brings the shard comfortably back under the time cap. ## Evidence (recurring flake, last ~2 weeks) Same two tests, same `TimeoutException` signature, across independent runs: | Run | Date | Failing test(s) | |---|---|---| | [27910160773](https://github.com/apache/systemds/actions/runs/27910160773) | Jun 21 | `testModifiedValLineageSP`, `testRowSumsSP` | | [27645002589](https://github.com/apache/systemds/actions/runs/27645002589) | Jun 16 | `testModifiedValLineageSP` | | [27542041413](https://github.com/apache/systemds/actions/runs/27542041413) | Jun 15 | `testModifiedValLineageSP`, `testPlusScalarCP`, `testRowSumsSP` | | [27363185073](https://github.com/apache/systemds/actions/runs/27363185073) | Jun 11 | `testModifiedValLineageSP`, `testRow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
