The GitHub Actions job "Benchmarks PR Comment" on texera.git/main has failed. Run started by GitHub user SarahAsad23 (triggered by SarahAsad23).
Head commit for run: 227cbd73960afbcaa734b30f3ac108dc669324f3 / Kunwoo (Chris) <[email protected]> fix(workflow-core): paginate S3 deleteDirectory deletions (#5569) ### What changes were proposed in this PR? `S3StorageClient.deleteDirectory` listed objects with a single `listObjectsV2` call and issued one `deleteObjects` batch. Both S3 APIs cap at 1000 keys per call, so for any prefix holding more than 1000 objects only the first 1000 were deleted and the rest causes a storage leak. This affects dataset deletion (`DatasetResource`) and per-execution cleanup (`LargeBinaryManager`), either of which can exceed 1000 objects under one prefix. This PR: - Lists via `listObjectsV2Paginator`, which follows the continuation token across all pages, and deletes in batches of at most 1000 keys. Keys are streamed so memory stays bounded to a single batch. - Inspects each `DeleteObjects` response and throws if any key failed. ### Any related issues, documentation, discussions? Closes #5281 ### How was this PR tested? 1. Create more than 1000 files `for i in {1..1100}; do printf 'x' > "file_$i.txt"; done` 2. Upload them in a dataset. (There is a frontend memory issue when you upload all 1100 files at the same time. Try to upload batch-by-batch) 3. Delete the dataset. 4. Check if all the files are removed in the minio console. (Before this fix, some files remain) ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8) Report URL: https://github.com/apache/texera/actions/runs/27439506011 With regards, GitHub Actions via GitBox
