carloea2 opened a new issue, #5140:
URL: https://github.com/apache/texera/issues/5140

   ### What happened?
   
   `WorkflowResource.deleteWorkflow` opens a JDBC transaction and 
CASCADE-deletes the workflow row without first stopping any in-flight 
executions that target the same workflow. While the `ComputingUnitWorker` keeps 
writing to FK-child tables (`workflow_view_count`, `workflow_executions`, 
`workflow_user_likes`, …), the CASCADE check blocks on a row-level lock and 
never returns. Every subsequent `createWorkflow` / `deleteWorkflow` / 
view-count POST piles up behind it on the same lock.
   
   From the user's perspective the Workflows page becomes fully unresponsive: 
uploads hang with no error, deletes hang, and the webpack-dev-server proxy 
eventually emits `ECONNRESET` then `ECONNREFUSED`. Recovery requires restarting 
the JVMs.
   
   Problematic code at 
[`WorkflowResource.scala:631`](https://github.com/apache/texera/blob/main/amber/src/main/scala/org/apache/texera/web/resource/dashboard/user/workflow/WorkflowResource.scala#L631):
   
   ```scala
   context.transaction { _ =>
     for (wid <- workflowIDs.wids) {
       if (workflowOfUserExists(wid, user.getUid)) {
         workflowDao.deleteById(wid)
       } else {
         throw new BadRequestException("The workflow does not exist.")
       }
     }
   }
   ```
   
   No active-execution check, no `lock_timeout` / `statement_timeout`, no error 
path — the request thread sits in `executeQuery` indefinitely.
   
   ### Suggested fixes (in order of preference)
   
   1. **Cancel running executions before deleting.** In `deleteWorkflow`, look 
up active executions of the workflow via `ExecutionResultService` / 
`WorkflowExecutionsResource` and abort them before opening the delete 
transaction. Deleting a workflow should imply "stop everything that depends on 
it".
   2. **Bound the delete transaction.** `SET LOCAL lock_timeout = '10s'; SET 
LOCAL statement_timeout = '30s';` at the start of the transaction so a hung 
child-table lock surfaces as a 5xx instead of freezing the entire workflow API.
   3. **Independently, harden `HubResource.postView`.** It blindly upserts into 
`workflow_view_count` for whatever wid the dashboard sends; if that wid was 
just deleted in another tab, the FK violation throws as a 500 and stale tabs 
keep retrying, exacerbating the contention. An existence check 
(`context.fetchExists(BaseEntityTable(entityType).table, 
idColumn.eq(entityID))`) before the upsert turns those into a no-op `return 0`.
   
   ### Workaround
   
   Kill the Texera JVMs (`TexeraWebApplication`, `ComputingUnitWorker`, 
`ComputingUnitMaster`), restart them, then reload the Workflows page to clear 
any cached stale wids being POSTed for view-count.
   
   ### How to reproduce?
   
   1. Open a workflow and start an execution that keeps the worker busy for >10 
s (e.g. an iris ML pipeline).
   2. While the execution is still running, navigate to 
`/dashboard/user/workflow` and delete that workflow from the row's delete 
action.
   3. Try to upload another workflow (or delete a second one) from the same 
page.
   
   **Expected:** upload completes; delete completes once the execution is 
canceled or finishes.
   **Observed:** delete hangs forever, upload hangs forever, every subsequent 
workflow-table write piles up behind the same lock. After enough pileup the JVM 
closes connections under socket pressure and the dev-server proxy starts 
emitting `ECONNRESET → ECONNREFUSED`.
   
   ### Branch
   
   main
   
   ### Commit Hash (Optional)
   
   _No response_
   
   ### What browsers are you seeing the problem on?
   
   Not browser-specific — reproduces on any client; the freeze is server-side.
   
   ### Relevant log output
   
   ```shell
   # Thread dump of TexeraWebApplication while the API is frozen
   # Problem: one open delete transaction holding the row lock,
   # every other workflow-table write queued behind it.
   
   "dw-NN" #N daemon (waiting on Postgres response)
      at 
org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:137)
      at 
org.jooq.tools.jdbc.DefaultPreparedStatement.executeQuery(DefaultPreparedStatement.java:104)
      at 
org.jooq.impl.AbstractDMLQuery.executeReturningQuery(AbstractDMLQuery.java:1249)
      at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:428)
      at org.jooq.impl.AbstractDMLQuery.execute(AbstractDMLQuery.java:961)
      at org.jooq.impl.DAOImpl.deleteById(DAOImpl.java:284)
      at 
org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource.$anonfun$deleteWorkflow$3(WorkflowResource.scala:634)
      at 
org.jooq.impl.DefaultDSLContext.lambda$transaction$5(DefaultDSLContext.java:612)
      at org.jooq.impl.DefaultDSLContext.transaction(DefaultDSLContext.java:611)
      at 
org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource.deleteWorkflow(WorkflowResource.scala:631)
   
   "dw-MM" / "dw-OO" / "dw-PP" ... (queued behind the open transaction)
      at 
org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:137)
      at org.jooq.impl.AbstractDMLQuery.execute(AbstractDMLQuery.java:1074)
      at org.jooq.impl.TableRecordImpl.storeInsert0(TableRecordImpl.java:193)
      at org.jooq.impl.TableRecordImpl.insert(TableRecordImpl.java:140)
      at org.jooq.impl.DAOImpl.insert(DAOImpl.java:156)
      at 
org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource$.insertWorkflow(WorkflowResource.scala:89)
      at 
org.apache.texera.web.resource.dashboard.user.workflow.WorkflowResource.createWorkflow(WorkflowResource.scala:573)
   
   # Frontend webpack-dev-server proxy view of the same incident:
   [HPM] Error occurred while proxying request 
localhost:4200/api/workflow/create to http://localhost:8080/ [ECONNRESET]
   [HPM] Error occurred while proxying request 
localhost:4200/api/workflow/delete to http://localhost:8080/ [ECONNRESET]
   ... (many lines later, after enough socket exhaustion)
   [HPM] Error occurred while proxying request 
localhost:4200/api/workflow/create to http://localhost:8080/ [ECONNREFUSED]
   
   # Foreign-key violation that compounds the contention while the lock is held 
—
   # fired by stale dashboard tabs POSTing /api/hub/view for the deleted wid:
   org.jooq.exception.DataAccessException: SQL [insert into 
"texera_db"."workflow_view_count" ("wid", "view_count") values (?, ?)
     on conflict ("wid") do update set "view_count" = 
("texera_db"."workflow_view_count"."view_count" + ?)
     returning "texera_db"."workflow_view_count"."view_count"];
   ERROR: insert or update on table "workflow_view_count" violates foreign key 
constraint "workflow_view_count_wid_fkey"
   Detail: Key (wid)=(173) is not present in table "workflow".
       at 
org.apache.texera.web.resource.dashboard.hub.HubResource.postView(HubResource.scala:401)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to