Yicong-Huang opened a new issue, #4679:
URL: https://github.com/apache/texera/issues/4679
### Task Summary
Two e2e specs — `DataProcessingSpec` and `ReconfigurationSpec` — sleep
1000ms after the workflow reaches `COMPLETED` and before reading the result
document. The sleep is annotated with a TODO that calls it a workaround for
"the issue of reporting `completed` status too early":
```scala
.map(terminalOpId => {
//TODO: remove the delay after fixing the issue of reporting "completed"
status too early.
Thread.sleep(1000)
val uri = getResultUriByLogicalPortId(...)
...
})
```
The sleep no longer matches the engine's behavior. Tracing the path that
emits `COMPLETED`:
1. `DataProcessor.outputOneTuple` handles a `FinalizePort(portId,
input=false)` by calling
`outputManager.closeOutputStorageWriterIfNeeded(portId)`.
2. `closeOutputStorageWriterIfNeeded` puts a terminate signal on the writer
thread's queue **and `join()`s the thread**.
3. The writer thread, on terminate, calls `IcebergTableWriter.close()` →
`flushBuffer()` → `table.newAppend().appendFile(dataFile).commit()`
(synchronous).
4. Only after the join returns does `DataProcessor` fire `portCompleted` to
the controller.
5. After all ports complete and the worker fires `workerExecutionCompleted`,
the controller checks workflow-terminal and only then sends
`ExecutionStateUpdate(COMPLETED)` to the client.
6. On the read side, `IcebergDocument.get()` calls `seekToUsableFile()`,
which `table.refresh()`es before scanning.
Every step is synchronous or blocking on the writer; there is no observable
race for the test reader to hit. The sleep is leftover from an earlier engine
version where commit was not synchronous.
### Local validation
Removed the sleep from both files and ran the two specs together 5×:
| run | tests | wall-clock |
|---|---|---|
| 1 | 21/21 ✅ | 39.2 s |
| 2 | 21/21 ✅ | 39.5 s |
| 3 | 21/21 ✅ | 43.2 s |
| 4 | 21/21 ✅ | 37.3 s |
| 5 | 21/21 ✅ | 36.9 s |
Compared with the same two specs on the same machine before the change:
| spec | before | after | delta |
|---|---|---|---|
| DataProcessingSpec | 36.5 s / 16 tests | 16.9 s / 16 tests | -19.6 s |
| ReconfigurationSpec | 27.2 s / 5 tests | 18.7 s / 5 tests | -8.5 s |
The 16-test DataProcessingSpec saves more than 16 × 1 s because the sleep
was inside a per-terminal-port `.map(...)` block — workflows with multiple
terminal ports paid the cost more than once per test.
### Out of scope
- The deeper "premature COMPLETED" claim from the TODO is not reproduced by
today's engine code path. If a real flake recurs, this issue can be re-opened
with a stack trace; in the meantime, the sleep is dead workaround code.
- A separate cleanup target is the `withRetry { super.withFixture(test) }`
pattern these same two specs (and `PauseSpec`) carry without any test being
tagged `Retryable` — currently a no-op.
### Task Type
- [x] Refactor / Cleanup
- [x] Testing / QA
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]