Baunsgaard opened a new pull request, #2505:
URL: https://github.com/apache/systemds/pull/2505

   The 
`**.functions.federated.monitoring.**,**.functions.federated.multitenant.**` CI 
shard has been intermittently failing and, when reruns pile up, timing out at 
the 30-minute job cap (e.g. [run 
28041416088](https://github.com/apache/systemds/actions/runs/28041416088/job/83008345704)).
   
   Root cause: the multi-tenant test config pins `sysds.federated.timeout` to 
**16s**. This value bounds *both* federated instruction execution 
(`FederationMap`/`FederatedData`) and end-of-run stats collection 
(`FederatedStatistics.collectFedStats`). For the Spark-backed (`*SP`) variants 
of the reuse tests, Spark context creation alone is ~14s, so under shared CI 
load a single federated request regularly exceeds 16s and throws 
`TimeoutException`:
   
   - `SPARK°rightIndex°…  DMLRuntimeException -- 
java.util.concurrent.TimeoutException`
   - `Exception … thrown while getting the federated stats of the federated 
response`
   
   The spurious timeouts make `FederatedReuseReadTest.testModifiedValLineageSP` 
and `FederatedSerializationReuseTest.testRowSumsSP` fail; surefire then reruns 
each failing test, and the accumulated rerun time repeatedly pushes the shard 
past the 30-minute cap, cancelling the whole job.
   
   This bumps the timeout to **60s**: still a hard bound on a genuinely runaway 
request (the suite cannot hang silently — the reason it was lowered from 128 → 
16 in `8f5a42c0`), but enough headroom for the Spark variants to pass on the 
first attempt, which also removes the expensive reruns and brings the shard 
comfortably back under the time cap.
   
   ## Evidence (recurring flake, last ~2 weeks)
   
   Same two tests, same `TimeoutException` signature, across independent runs:
   
   | Run | Date | Failing test(s) |
   |---|---|---|
   | [27910160773](https://github.com/apache/systemds/actions/runs/27910160773) 
| Jun 21 | `testModifiedValLineageSP`, `testRowSumsSP` |
   | [27645002589](https://github.com/apache/systemds/actions/runs/27645002589) 
| Jun 16 | `testModifiedValLineageSP` |
   | [27542041413](https://github.com/apache/systemds/actions/runs/27542041413) 
| Jun 15 | `testModifiedValLineageSP`, `testPlusScalarCP`, `testRowSumsSP` |
   | [27363185073](https://github.com/apache/systemds/actions/runs/27363185073) 
| Jun 11 | `testModifiedValLineageSP`, `testRow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to