kunwp1 opened a new pull request, #5280:
URL: https://github.com/apache/texera/pull/5280
### What changes were proposed in this PR?
Large binaries were stored in the shared `texera-large-binaries` bucket
under flat keys `objects/{timestamp}/{uuid}` with no execution id, and
`clearExecutionResources(eid)` deleted **all** of them via
`LargeBinaryManager.deleteAllObjects()`. Any cleanup for one execution
therefore erased every other execution's (and user's) large binaries — a
tenant-isolation violation and global data loss.
This PR namespaces every large binary by its execution id and scopes
deletion:
- Object keys are now `objects/{eid}/{uuid}` on both the JVM and Python
workers.
- The execution id is carried to workers via a new
`InitializeExecutorRequest.executionId` proto field (compiled for both scalapb
and betterproto), injected by the system at executor init — the user-facing
`largebinary()` / `new LargeBinary()` APIs are unchanged.
- Cleanup uses the new `LargeBinaryManager.deleteByExecution(eid)` (prefix
delete of `objects/{eid}/`). Both engines share the bucket and key shape, so
this single JVM-side delete removes binaries created by both.
- The bucket-wide `deleteAllObjects()` is removed. A second occurrence of
the same global-wipe bug in `WorkflowResource.deleteWorkflow` is also fixed
(now scoped to the affected workflows' executions).
Out of scope: pre-existing objects under the old `objects/{timestamp}/...`
scheme are left untouched. Note: `S3StorageClient.deleteDirectory` deletes only
the first ≤1000 objects under a prefix (pre-existing); executions producing
>1000 binaries could leave orphans — follow-up.
### Any related issues, documentation, discussions?
Closes #4123.
### How was this PR tested?
- `LargeBinaryManagerSpec` (workflow-core), incl. a new isolation test:
binaries created under two executions, delete one, assert the other survives —
33/33 locally (needs Docker/MinIO).
- `WorkerSpec` (amber) — 3/3 (worker executor-init path carrying the new
field).
- Python `test_large_binary_manager.py` — `create()` yields
`objects/{eid}/...` and raises with no execution context; passing, ruff clean.
- Full `sbt compile` + test-compile of touched modules.
- Not yet done: a live end-to-end run (File Scan `LARGE_BINARY` + Python UDF
`largebinary()`) to confirm `objects/{eid}/...` keys and scoped cleanup across
both engines.
### Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), models Claude Opus 4.7 and Claude
Sonnet 4.6
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]