The GitHub Actions job "Benchmarks PR Comment" on texera.git/main has failed.
Run started by GitHub user SarahAsad23 (triggered by SarahAsad23).

Head commit for run:
227cbd73960afbcaa734b30f3ac108dc669324f3 / Kunwoo (Chris) 
<[email protected]>
fix(workflow-core): paginate S3 deleteDirectory deletions (#5569)

### What changes were proposed in this PR?

`S3StorageClient.deleteDirectory` listed objects with a single
`listObjectsV2` call and issued one `deleteObjects` batch. Both S3 APIs
cap at 1000 keys per call, so for any prefix holding more than 1000
objects only the first 1000 were deleted and the rest causes a storage
leak. This affects dataset deletion (`DatasetResource`) and
per-execution cleanup (`LargeBinaryManager`), either of which can exceed
1000 objects under one prefix.

This PR:
- Lists via `listObjectsV2Paginator`, which follows the continuation
token across all pages, and deletes in batches of at most 1000 keys.
Keys are streamed so memory stays bounded to a single batch.
- Inspects each `DeleteObjects` response and throws if any key failed.

### Any related issues, documentation, discussions?

Closes #5281

### How was this PR tested?

1. Create more than 1000 files `for i in {1..1100}; do printf 'x' >
"file_$i.txt"; done`
2. Upload them in a dataset. (There is a frontend memory issue when you
upload all 1100 files at the same time. Try to upload batch-by-batch)
3. Delete the dataset.
4. Check if all the files are removed in the minio console. (Before this
fix, some files remain)

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

Report URL: https://github.com/apache/texera/actions/runs/27439506011

With regards,
GitHub Actions via GitBox

Reply via email to