Hi all,
I'd like to propose an API and corresponding implementation for (long
running) object store operations.
It provides a CPU and heap-friendly API and implementation to work
against object stores. It is built in a way to provide "pluggable"
functionality. What I mean is this (Java pseudo code):
---
FileOperations fileOps =
fileOperationsFactory.createFileOperations(fileIoInstance);
Stream<FileSpec> allIcebergTableFiles = fileOps.
identifyIcebergTableFiles(metadataLocation);
PurgeStats purged = fileOps.purge(allIcebergTableFiles);
// or simpler:
PurgeStats purged = fileOps.purgeIcebergTable(metadataLocation);
// or similarly for Iceberg views
PurgeStats purged = fileOps.purgeIcebergView(metadataLocation);
// or to purge all files underneath a prefix
PurgeStats purged = fileOps.purge(fileOps.findFiles(prefix));
---
Not mentioned in the pseudo code is the ability to rate-limit the
number of purged files or batch-deletions and configure the deletion
batch-size.
The PR already contains tests against an on-heap object store mock and
integration tests against S3/GCS/Azure emulators.
More details can be found in the README [2] included in the PR and of
course in the code in the PR.
Robert
[1] https://github.com/apache/polaris/pull/3256
[2] https://github.com/snazy/polaris/blob/obj-store-ops/storage/files/README.md