eugenegujing opened a new pull request, #5643:
URL: https://github.com/apache/texera/pull/5643

   ### What changes were proposed in this PR?
   
   Adds a configurable scheduled cleanup job to file-service that reclaims the 
storage held by uploads that were started (or staged) but never committed.
   
   **Problem.** Files uploaded to a dataset sit in LakeFS staging until the 
user commits a version. LakeFS GC only manages committed data, and Texera has 
no background cleanup, so abandoned uploads accumulate forever in MinIO/S3, 
along with their `DATASET_UPLOAD_SESSION` rows. The only existing mitigations 
are the 24h presigned-URL expiry and the lazy re-init cleanup in 
`initMultipartUpload`, neither of which removes staged data for datasets nobody 
touches again.
   
   **Design.**
   
   ```
   FileService.run() ──(if enabled)──> Managed StagedFileCleanupJob
      every interval-minutes, on one daemon thread:
      ├─ path 1: DATASET_UPLOAD_SESSION rows with created_at < now − retention
      │          → abort LakeFS multipart, then delete row (parts cascade)
      ├─ path 2: per dataset repo, staged objects with mtime < now − retention
      │          → branch-reset via existing LakeFS API
      └─ audit log: sessions deleted / objects reset / errors
   ```
   
   Safety properties (each independently enforced):
   
   | Guard | Mechanism |
   |---|---|
   | Committed data is untouchable | the job only calls staging-scoped APIs 
(`diffBranch`, branch reset, multipart abort) — committed objects never appear 
in its inputs |
   | No race with an in-flight upload | staged objects whose path belongs to a 
non-expired session are excluded (exact complement of the deletion cutoff); 
default retention 72h is far above the 24h 
`PHYSICAL_ADDRESS_EXPIRATION_TIME_HRS`, so deleted sessions are already 
non-resumable |
   | Crash-safe ordering | multipart abort happens before the row delete; a 
crash in between leaves the row, which is retried next round (abort 404 = 
already aborted) |
   | Idempotent & isolated | every action is safely repeatable; per-item 
failures are logged, counted in the report, and never abort the batch; staged 
deletions and orphaned repos (dataset row without a live LakeFS repo) are 
skipped instead of erroring every round |
   | Operator control | `storage.cleanup.enabled` / `retention-hours` / 
`interval-minutes`, each overridable via `STORAGE_CLEANUP_*` env vars; the 
whole feature can be disabled |
   
   Also in this PR: file-service test suites now fork one JVM per suite 
(`build.sbt`). Each testcontainers-based suite boots its own 
LakeFS/MinIO/Postgres stack and mutates JVM-wide singletons (`StorageConfig` 
endpoints), so sharing one JVM broke whichever container suite ran second; 
forking isolates the Docker client and singleton state. `LakeFSStorageClient` 
gains one helper (`getStagedObjectMtime`) because the diff API carries no 
timestamps.
   
   Feedback welcome on the default retention (72h ≈ covers a weekend; it is a 
config default, not a hard-coded value).
   
   > **Note:** since this change automatically deletes user-uploaded data, it 
touches a fairly high-risk area. I have been deliberately conservative 
(staging-scoped APIs only, generous retention, kill switch, audit logging), but 
I'm happy to discuss any part of the design — including making the feature 
opt-in (`enabled = false` by default) if that feels safer for a first release.
   
   ### Any related issues, documentation, discussions?
   
   Closes #3681.
   
   ### How was this PR tested?
   
   New `StagedFileCleanupJobSpec` (11 tests, real LakeFS/MinIO/Postgres via 
testcontainers + embedded Postgres for the DAO): expired session deleted with 
part-row cascade; fresh session survives; expired staged object reset while a 
committed object survives; fresh staged object survives; idempotent second run 
reports zeros; active upload and its staged file untouched; mixed-batch exact 
counts; ±1min boundary around the cutoff; out-of-band-aborted multipart treated 
as already-aborted (404 rule); per-item failure counted without aborting the 
batch.
   
   ```
   sbt "FileService/test"
   ```
   
   111 tests, 4 suites, all passed (includes the pre-existing 
`DatasetResourceSpec` 93 tests — no regression). Also ran `sbt scalafixAll` and 
`sbt scalafmtAll` (clean). Requires Docker for the testcontainers suites.
   
   ### Was this PR authored or co-authored using generative AI tooling?
   
   Co-authored by: Claude Code (Claude Fable 5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to