[I] Flaky DataProcessingSpec intermittently fails the amber CI job [texera]

via GitHub Wed, 10 Jun 2026 17:35:58 -0700


aglinxinyuan opened a new issue, #5614:
URL: https://github.com/apache/texera/issues/5614


   ## Summary
   
   `build / amber (ubuntu-22.04, 17)` keeps failing intermittently with exactly 
one failed test in `org.apache.texera.amber.engine.e2e.DataProcessingSpec` 
(920/921 pass). Re-running the job passes. It is hitting pushes to `main`, PR 
builds, and merge-queue runs.
   
   ## Recent occurrences (all first attempts)
   
   | When (UTC) | Trigger | Failing job |
   |---|---|---|
   | 2026-06-10 21:02 | push to `main` | [run 
27305466218](https://github.com/apache/texera/actions/runs/27305466218/job/80661998786)
 |
   | 2026-06-10 12:20 | merge queue (pr-5417) | [run 
27273904225](https://github.com/apache/texera/actions/runs/27273904225/job/80554243749)
 |
   | 2026-06-10 07:58 | PR `ui-parameter-backend-python` | [run 
27260606211](https://github.com/apache/texera/actions/runs/27260606211/job/80507060550)
 |
   | 2026-06-10 03:46 | merge queue (pr-5317) | [run 
27251321786](https://github.com/apache/texera/actions/runs/27251321786/job/80476278198)
 |
   | 2026-06-07 02:06 | merge queue (pr-5402) | [run 
27079691742](https://github.com/apache/texera/actions/runs/27079691742/job/79923261299)
 |
   
   All five logs end the same way:
   
   ```
   [info] *** 1 TEST FAILED ***
   [error] Failed tests:
   [error]   org.apache.texera.amber.engine.e2e.DataProcessingSpec
   [error] (WorkflowExecutionService / Test / test) sbt.TestsFailedException: 
Tests unsuccessful
   ```
   
   ## What the log shows (run 27305466218)
   
   The console does not print which test inside the suite failed, but the 
`DataProcessingSpec` activity timeline reconstructs it:
   
   ```
   20:56:53  suite starts; early tests pass normally
   20:56:58  EndHandler throws IllegalStateException
             ("worker still has unprocessed messages") on an aggregate worker
             -> controller logs "Failed to terminate region 0"
   20:57:03  workers for the next aggregate workflow register ... then nothing
      |
      |      (zero DataProcessingSpec log lines for a full minute)
      v
   20:58:02  one lone Iceberg commit retry on a KeywordSearch result table
      |
      |      (silent for another full minute)
      v
   20:59:03  join-test workflows run and finish in <1s
   20:59:04  last DataProcessingSpec activity; suite later reports 1 test failed
   ```
   
   The two consecutive ~60-second silent windows match the 1-minute `Await` in 
[`executeWorkflow`](https://github.com/apache/texera/blob/main/amber/src/test/scala/org/apache/texera/amber/engine/e2e/DataProcessingSpec.scala#L149)
 plus the one automatic retry the suite already performs 
([`withFixture`/`withRetry`](https://github.com/apache/texera/blob/main/amber/src/test/scala/org/apache/texera/amber/engine/e2e/DataProcessingSpec.scala#L65-L71))
 — i.e. the test timed out waiting for `COMPLETED`, the retry timed out again, 
and the suite failed.
   
   ### Signal 1 — region termination fails
   
   ```
   [ERROR] [WF1-AggregateOpDesc-514b1b-globalAgg-0] [AsyncRPCServer] - 
Exception occurred
   java.lang.IllegalStateException: worker still has unprocessed messages
       at ...worker.promisehandlers.EndHandler.endWorker(EndHandler.scala:51)
       ...
   [WARN] [CONTROLLER] [RegionExecutionCoordinator] - Failed to terminate 
region 0
   ```
   
   A region that never terminates means the workflow never reaches `COMPLETED`, 
which is exactly what the 1-minute `Await` then times out on.
   
   ### Signal 2 — Iceberg commit conflicts on operator-port-result tables
   
   The same run has 18 `CommitFailedException` retry warnings ("metadata 
location ... has changed") on `operator-port-result.wid_1_eid_1_...` tables, 
coming from the output-port storage writers:
   
   ```
   org.apache.iceberg.exceptions.CommitFailedException: Cannot commit 
operator-port-result.wid_1_eid_1_... :
   metadata location .../metadata/00000-....metadata.json has changed from 
.../metadata/00001-....metadata.json
       at 
org.apache.iceberg.jdbc.JdbcTableOperations.updateTable(JdbcTableOperations.java:166)
       ...
       at 
...storage.result.iceberg.IcebergTableWriter.flushBuffer(IcebergTableWriter.scala:131)
       at 
...storage.result.iceberg.IcebergTableWriter.close(IcebergTableWriter.scala:142)
       at 
...worker.managers.OutputPortStorageWriterThread.run(OutputPortStorageWriterThread.scala:63)
   ```
   
   Multiple workers of the same operator commit to the same port-result table 
concurrently, so the optimistic JDBC-catalog commit conflicts and retries. 
Unclear yet whether this is the cause of the timeout or just noise that slows 
port completion down.
   
   ## Notes
   
   - The spec already retries each test once specifically because "in the CI 
environment, there is a chance that executeWorkflow does not receive COMPLETED 
status" — the current failure mode survives that retry.
   - 4 of the 5 confirmed occurrences are within a single day (2026-06-10), so 
the failure rate seems to have increased recently.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Flaky DataProcessingSpec intermittently fails the amber CI job [texera]

Reply via email to