carloea2 opened a new issue, #5140: URL: https://github.com/apache/texera/issues/5140
### What happened? `WorkflowResource.deleteWorkflow` opens a JDBC transaction and CASCADE-deletes the workflow row without first stopping any in-flight executions that target the same workflow. While the `ComputingUnitWorker` keeps writing to FK-child tables (`workflow_view_count`, `workflow_executions`, `workflow_user_likes`, …), the CASCADE check blocks on a row-level lock and never returns. Every subsequent `createWorkflow` / `deleteWorkflow` / view-count POST piles up behind it on the same lock. From the user's perspective the Workflows page becomes fully unresponsive: uploads hang with no error, deletes hang, and the webpack-dev-server proxy eventually emits `ECONNRESET` then `ECONNREFUSED`. Recovery requires restarting the JVMs. Problematic code at [`WorkflowResource.scala:631`](https://github.com/apache/texera/blob/main/amber/src/main/scala/org/apache/texera/web/resource/dashboard/user/workflow/WorkflowResource.scala#L631): ```scala context.transaction { _ => for (wid <- workflowIDs.wids) { if (workflowOfUserExists(wid, user.getUid)) { workflowDao.deleteById(wid) } else { throw new BadRequestException("The workflow does not exist.") } } } ``` No active-execution check, no `lock_timeout` / `statement_timeout`, no error path — the request thread sits in `executeQuery` indefinitely. ### Suggested fixes (in order of preference) 1. **Cancel running executions before deleting.** In `deleteWorkflow`, look up active executions of the workflow via `ExecutionResultService` / `WorkflowExecutionsResource` and abort them before opening the delete transaction. Deleting a workflow should imply "stop everything that depends on it". 2. **Bound the delete transaction.** `SET LOCAL lock_timeout = '10s'; SET LOCAL statement_timeout = '30s';` at the start of the transaction so a hung child-table lock surfaces as a 5xx instead of freezing the entire workflow API. 3. **Independently, harden `HubResource.postView`.** It blindly upserts into `workflow_view_count` for whatever wid the dashboard sends; if that wid was just deleted in another tab, the FK violation throws as a 500 and stale tabs keep retrying, exacerbating the contention. An existence check (`context.fetchExists(BaseEntityTable(entityType).table, idColumn.eq(entityID))`) before the upsert turns those into a no-op `return 0`. ### Workaround Kill the Texera JVMs (`TexeraWebApplication`, `ComputingUnitWorker`, `ComputingUnitMaster`), restart them, then reload the Workflows page to clear any cached stale wids being POSTed for view-count. ### How to reproduce? 1. Open a workflow and start an execution that keeps the worker busy for >10 s (e.g. an iris ML pipeline). 2. While the execution is still running, navigate to `/dashboard/user/workflow` and delete that workflow from the row's delete action. 3. Try to upload another workflow (or delete a second one) from the same page. **Expected:** upload completes; delete completes once the execution is canceled or finishes. **Observed:** delete hangs forever, upload hangs forever, every subsequent workflow-table write piles up behind the same lock. After enough pileup the JVM closes connections under socket pressure and the dev-server proxy starts emitting `ECONNRESET → ECONNREFUSED`. ### Branch main ### Commit Hash (Optional) _No response_ ### What browsers are you seeing the problem on? Not browser-specific — reproduces on any client; the freeze is server-side. ### Relevant log output ```shell # Thread dump of TexeraWebApplication while the API is frozen # Problem: one open delete transaction holding the row lock, # every other workflow-table write queued behind it. "dw-NN" #N daemon (waiting on Postgres response) at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:137) at org.jooq.tools.jdbc.DefaultPreparedStatement.executeQuery(DefaultPreparedStatement.java:104) at org.jooq.impl.AbstractDMLQuery.executeReturningQuery(AbstractDMLQuery.java:1249) at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:428) at org.jooq.impl.AbstractDMLQuery.execute(AbstractDMLQuery.java:961) at org.jooq.impl.DAOImpl.deleteById(DAOImpl.java:284) at org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource.$anonfun$deleteWorkflow$3(WorkflowResource.scala:634) at org.jooq.impl.DefaultDSLContext.lambda$transaction$5(DefaultDSLContext.java:612) at org.jooq.impl.DefaultDSLContext.transaction(DefaultDSLContext.java:611) at org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource.deleteWorkflow(WorkflowResource.scala:631) "dw-MM" / "dw-OO" / "dw-PP" ... (queued behind the open transaction) at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:137) at org.jooq.impl.AbstractDMLQuery.execute(AbstractDMLQuery.java:1074) at org.jooq.impl.TableRecordImpl.storeInsert0(TableRecordImpl.java:193) at org.jooq.impl.TableRecordImpl.insert(TableRecordImpl.java:140) at org.jooq.impl.DAOImpl.insert(DAOImpl.java:156) at org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource$.insertWorkflow(WorkflowResource.scala:89) at org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource.createWorkflow(WorkflowResource.scala:573) # Frontend webpack-dev-server proxy view of the same incident: [HPM] Error occurred while proxying request localhost:4200/api/workflow/create to http://localhost:8080/ [ECONNRESET] [HPM] Error occurred while proxying request localhost:4200/api/workflow/delete to http://localhost:8080/ [ECONNRESET] ... (many lines later, after enough socket exhaustion) [HPM] Error occurred while proxying request localhost:4200/api/workflow/create to http://localhost:8080/ [ECONNREFUSED] # Foreign-key violation that compounds the contention while the lock is held — # fired by stale dashboard tabs POSTing /api/hub/view for the deleted wid: org.jooq.exception.DataAccessException: SQL [insert into "texera_db"."workflow_view_count" ("wid", "view_count") values (?, ?) on conflict ("wid") do update set "view_count" = ("texera_db"."workflow_view_count"."view_count" + ?) returning "texera_db"."workflow_view_count"."view_count"]; ERROR: insert or update on table "workflow_view_count" violates foreign key constraint "workflow_view_count_wid_fkey" Detail: Key (wid)=(173) is not present in table "workflow". at org.apache.texera.web.resource.dashboard.hub.HubResource.postView(HubResource.scala:401) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
