aglinxinyuan commented on code in PR #5781:
URL: https://github.com/apache/texera/pull/5781#discussion_r3449140851


##########
amber/src/main/scala/org/apache/texera/web/service/WorkflowService.scala:
##########
@@ -292,11 +292,23 @@ class WorkflowService(
       executionService.onNext(execution)
       execution.executeWorkflow()
     } catch {
-      case e: Throwable => errorHandler(e)
+      case e: Throwable =>
+        errorHandler(e)
+        reportFatalErrorsToSubscribers(executionStateStore)

Review Comment:
   Good question — let me split it into the two parts.
   
   **Why the existing state-store path can't cover this:** it *does* work for a 
running execution, but not for the case this PR targets — an exception thrown 
inside `new WorkflowExecutionService(...)`. The `fatalErrors → 
WorkflowErrorEvent` diff handler is registered *inside* that constructor 
(`WorkflowExecutionService.scala:72-86`), and the only thing that forwards an 
execution's store to the socket is `connectToExecution`, which subscribes after 
`executionService.onNext(execution)` (`WorkflowService.scala:292`). If the 
constructor throws we jump straight to the `catch` and never reach line 292 — 
so `errorHandler` writes FAILED + fatalErrors into a metadata store that has 
*neither an emitter nor a subscriber*, and it's silently dropped. 
(`connect()`'s own state-store subscription is over `WorkflowStateStore`, which 
only holds the result store, so it can't carry a `WorkflowErrorEvent` either.)
   
   **On refresh:** you're right it isn't durable across a cold refresh — but 
that's already true of the state-store path, not something this change 
introduces. `fatalErrors` are never persisted (`updateWorkflowState` writes 
only a status byte), and `getWebsocketEventObservable` doesn't replay past 
diffs (it's `buffer(2,1)` over the live state subject), so a cold refresh loses 
the error on *either* path today. `errorSubject` is a `BehaviorSubject`, so a 
*within-instance* refresh does replay the last error via `connect()`; a refresh 
after the cleanup timeout (which evicts the instance) loses it — same as the 
state store would.
   
   **What I changed in response:** I agree it should go through the state store 
wherever it can, so I gated the `errorSubject` push to the **pre-publish window 
only**. Once the execution is published, the state-store diff handler already 
surfaces the error — pushing here too was actually double-emitting on the 
`executeWorkflow()` failure path. So `errorSubject` is now just the narrow 
fallback for the constructor-failure window, where the state store genuinely 
can't reach a subscriber yet. (pushed in `547e558`)
   
   For real cross-refresh durability we'd need to persist `fatalErrors` to 
`workflow_executions` and rehydrate in `getOrCreate` — happy to file that as a 
follow-up, since it helps both paths and is bigger than this PR. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to