aglinxinyuan opened a new issue, #5614: URL: https://github.com/apache/texera/issues/5614
## Summary `build / amber (ubuntu-22.04, 17)` keeps failing intermittently with exactly one failed test in `org.apache.texera.amber.engine.e2e.DataProcessingSpec` (920/921 pass). Re-running the job passes. It is hitting pushes to `main`, PR builds, and merge-queue runs. ## Recent occurrences (all first attempts) | When (UTC) | Trigger | Failing job | |---|---|---| | 2026-06-10 21:02 | push to `main` | [run 27305466218](https://github.com/apache/texera/actions/runs/27305466218/job/80661998786) | | 2026-06-10 12:20 | merge queue (pr-5417) | [run 27273904225](https://github.com/apache/texera/actions/runs/27273904225/job/80554243749) | | 2026-06-10 07:58 | PR `ui-parameter-backend-python` | [run 27260606211](https://github.com/apache/texera/actions/runs/27260606211/job/80507060550) | | 2026-06-10 03:46 | merge queue (pr-5317) | [run 27251321786](https://github.com/apache/texera/actions/runs/27251321786/job/80476278198) | | 2026-06-07 02:06 | merge queue (pr-5402) | [run 27079691742](https://github.com/apache/texera/actions/runs/27079691742/job/79923261299) | All five logs end the same way: ``` [info] *** 1 TEST FAILED *** [error] Failed tests: [error] org.apache.texera.amber.engine.e2e.DataProcessingSpec [error] (WorkflowExecutionService / Test / test) sbt.TestsFailedException: Tests unsuccessful ``` ## What the log shows (run 27305466218) The console does not print which test inside the suite failed, but the `DataProcessingSpec` activity timeline reconstructs it: ``` 20:56:53 suite starts; early tests pass normally 20:56:58 EndHandler throws IllegalStateException ("worker still has unprocessed messages") on an aggregate worker -> controller logs "Failed to terminate region 0" 20:57:03 workers for the next aggregate workflow register ... then nothing | | (zero DataProcessingSpec log lines for a full minute) v 20:58:02 one lone Iceberg commit retry on a KeywordSearch result table | | (silent for another full minute) v 20:59:03 join-test workflows run and finish in <1s 20:59:04 last DataProcessingSpec activity; suite later reports 1 test failed ``` The two consecutive ~60-second silent windows match the 1-minute `Await` in [`executeWorkflow`](https://github.com/apache/texera/blob/main/amber/src/test/scala/org/apache/texera/amber/engine/e2e/DataProcessingSpec.scala#L149) plus the one automatic retry the suite already performs ([`withFixture`/`withRetry`](https://github.com/apache/texera/blob/main/amber/src/test/scala/org/apache/texera/amber/engine/e2e/DataProcessingSpec.scala#L65-L71)) — i.e. the test timed out waiting for `COMPLETED`, the retry timed out again, and the suite failed. ### Signal 1 — region termination fails ``` [ERROR] [WF1-AggregateOpDesc-514b1b-globalAgg-0] [AsyncRPCServer] - Exception occurred java.lang.IllegalStateException: worker still has unprocessed messages at ...worker.promisehandlers.EndHandler.endWorker(EndHandler.scala:51) ... [WARN] [CONTROLLER] [RegionExecutionCoordinator] - Failed to terminate region 0 ``` A region that never terminates means the workflow never reaches `COMPLETED`, which is exactly what the 1-minute `Await` then times out on. ### Signal 2 — Iceberg commit conflicts on operator-port-result tables The same run has 18 `CommitFailedException` retry warnings ("metadata location ... has changed") on `operator-port-result.wid_1_eid_1_...` tables, coming from the output-port storage writers: ``` org.apache.iceberg.exceptions.CommitFailedException: Cannot commit operator-port-result.wid_1_eid_1_... : metadata location .../metadata/00000-....metadata.json has changed from .../metadata/00001-....metadata.json at org.apache.iceberg.jdbc.JdbcTableOperations.updateTable(JdbcTableOperations.java:166) ... at ...storage.result.iceberg.IcebergTableWriter.flushBuffer(IcebergTableWriter.scala:131) at ...storage.result.iceberg.IcebergTableWriter.close(IcebergTableWriter.scala:142) at ...worker.managers.OutputPortStorageWriterThread.run(OutputPortStorageWriterThread.scala:63) ``` Multiple workers of the same operator commit to the same port-result table concurrently, so the optimistic JDBC-catalog commit conflicts and retries. Unclear yet whether this is the cause of the timeout or just noise that slows port completion down. ## Notes - The spec already retries each test once specifically because "in the CI environment, there is a chance that executeWorkflow does not receive COMPLETED status" — the current failure mode survives that retry. - 4 of the 5 confirmed occurrences are within a single day (2026-06-10), so the failure rate seems to have increased recently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
