I believe that now I understand how to leverage the metadata tables to deal with removing orphan files.
I didn't know that the DELETE_FILES metadata table existed, so I believe this is what Fokko meant. Fokko, was your idea to use the DELETE_FILES and ALL_FILES metadata tables? Do you know why these metadata tables are not used in the Spark implementation? Java doc reference: https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/MetadataTableType.html André Anastácio On Tuesday, August 13th, 2024 at 7:34 AM, Steve Loughran <ste...@cloudera.com.INVALID> wrote: > On Tue, 13 Aug 2024 at 03:50, Xuanwo <xua...@apache.org> wrote: > >> Hi, André >> >> Thanks a lot for starting this thread. >> >> List operations on storage services are expensive and slow. That's why >> Iceberg is designed to store metadata in files and avoid using list >> operations in FileIO. However, `orphan file removal` or `garbage cleanup` >> are special tasks that do require scanning the entire storage location and >> comparing it with our existing metadata files. > > Not quite. > > Listing via treewalking is awful because it has both high latency on "pure" > object stores with client-side mimiced directories. And you will pay per LIST > call. > > In S3, as implemented by S3FileIO and HadoopIO, the the > SupportsPrefixOperations.listPrefix() operation is independent of directory > structure and instead just O(files). Results come back in pages, about 1000 > or so. If you have versioned buckets you'll get less if there are many > overwritten/tombstoned objects. To compensate for this you should really > schedule processing of data into separate threads from that during the > listing. This is actually beneficial on classic tree walks on high-latency > stores with "read" directories -including Azure ADLS Gen2. > >> I believe that if there is a way to ensure all engines use List operations >> correctly ( don't abuse list! ), it would be beneficial for us to introduce >> list files in FileIO. > >> I believe that if there is a way to ensure all engines use List operations >> correctly ( don't abuse list! ), it would be beneficial for us to introduce >> list files in FileIO. > > Given SupportsPrefixOperations exists: use listPrefix() > Similarly, use SupportsBulkOperations.deleteFiles() for bulk deletion > > You should actually be able to wire them up, either directly: > deleteFiles(listPrefix(path)), or more interestingly, with a filter in > between. This would integrate paged LIST results with paged single/bulk > delete calls. > > The S3FileIO.deleteFiles() and the hadoop 3.4.1 variant (which will be ready > for review once we ship that) https://github.com/apache/iceberg/pull/10233 > can both do the bulk delete in aggregate calls, which S3FileIO well actually > do asynchronously. Each row in the batch counts as one write operation and is > trivial to trigger throttling; if the AWS SDK is doing the retries you > wouldn't even directly notice it -but all clients writing to that S3 shard > will be delayed. Being aggressive here is a bit antisocial for any background > vacuuming task. > > Anyway: use listFiles(), but know that even if deleteFiles() is optimised for > cloud storage it can be slow and impact every other application writing to > the same store. And someone should update > org.apache.iceberg.aws.util.RetryDetector to count throttle events the way we > do in the s3a codebase. > >> I prefer to have this in FileIO and eventually exposed in >> pyicberg/iceberg-rust's public API instead of letting users use opendal >> directly. The public API could be a metadata table or something similar; I >> haven't given it much thought yet. >> >> FileIO is now a widely shared design across different language >> implementations, and we have built a mature mechanism to allow users to >> implement and provide their own FileIO. By adding a new API in FileIO, we >> can ensure that we are not favoring any specific FileIO implementation. > > Given SupportsPrefixOperations is there, just use that if the FileIO instance > supports it. > >> On Tue, Aug 13, 2024, at 07:01, André Luis Anastácio wrote: >> >>> Thank you Fokko about the context! This blog post helped me a lot! >>> >>> I understand that in the Iceberg Java implementation the maintenance >>> procedures are just >>> [interfaces](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/actions/DeleteOrphanFiles.java#L34), >>> and the implementation is done on the [engine >>> side](https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L103). >>> What do you think about this for PyIceberg? >>> >>>> I was hoping to leverage the metadata tables for that. >>> >>> I’m not sure if I understand correctly. Do you mean that the idea would be >>> to access the metadata using the metadata tables through the table public >>> API instead of reading the metadata files directly? >>> >>> If I understood correctly, and following what was done in the Java >>> implementation, what are your thoughts on having the procedures module >>> using only the PyIceberg public API and OpenDAL to handle with filesystem? >>> With that, we would have something that is not coupled with the PyIceberg >>> internals. >>> >>> André Anastácio >>> >>> On Monday, August 12th, 2024 at 5:03 PM, Fokko Driesprong >>> <fo...@apache.org> wrote: >>> >>>> Hi André, >>>> >>>> First of all, thanks for raising this. Maintenance routines are a >>>> long-awaited functionality in PyIceberg. >>>> >>>> The [FileIO concept](https://iceberg.apache.org/fileio/) is not limited to >>>> PyIceberg, but is [also present in >>>> Java](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/io/FileIO.java) >>>> and >>>> [Iceberg-Rust](https://github.com/apache/iceberg-rust/blob/bbbea9751439dea6afb85f5acf0f3689357cf3de/crates/iceberg/src/io/file_io.rs#L40). >>>> The main focus of FileIO is to provide object-store native operations to >>>> the Iceberg client (an excellent blog can be found >>>> [here](https://tabular.io/blog/iceberg-fileio-cloud-native-tables/)). >>>> Based on this, I don't think we want to create a first-class citizen for >>>> FileSystem-like operations, because Iceberg is designed to work with >>>> object stores native operations. >>>> >>>> That said, in PyIceberg the abstraction between the engine and the FileIO >>>> is not as clear as in other implementations. This is mostly because the >>>> [ArrowFileIO](https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L328) >>>> returns Arrow buffers, and therefore we ended up with a more closely >>>> related implementation than desired. It would be good to see if we can >>>> untangle that, and I'm sure that once we get OpenDAL or Iceberg-Rust in >>>> there, there will be a strong need to do that. >>>> >>>> Orphan files is quite a resource-intensive operation since it requires >>>> listing all the files under the location, and comparing this with all the >>>> files in the metadata (I was hoping to leverage the metadata tables for >>>> that). >>>> >>>> Hope this helps! >>>> >>>> Kind regards, >>>> Fokko >>>> >>>> Op ma 12 aug 2024 om 14:38 schreef André Luis Anastácio >>>> <ndrl...@proton.me.invalid>: >>>> >>>>> Hello everyone, >>>>> >>>>> I’ve been studying the Java implementation of orphan file removal to >>>>> replicate it in PyIceberg. During this process, I noticed a key >>>>> difference: in Java, we use the Hadoop Filesystem[1], while in PyIceberg, >>>>> we use the Filesystem provided by FileIO[2][3]. >>>>> >>>>> Currently, we support two FileIO implementations: Fsspec and PyArrow. >>>>> However, there is a hard requirement to use PyArrow for the reading >>>>> process, and when we instantiate the FileSystem, we wrap Fsspec with the >>>>> PyArrow interface[4][5]. >>>>> >>>>> Thus, we can say that the default filesystem interface is the PyArrow one. >>>>> >>>>> In the future, we aim to use the FileIO from rust-iceberg, which >>>>> leverages OpenDAL—a tool that doesn’t have wrappers for the Fsspec or >>>>> Arrow interfaces. >>>>> >>>>> For the FileIO context (write/read/delete operations), I believe we are >>>>> in good shape. The challenge arises when we need to access the Filesystem >>>>> object to handle tasks like listing files. >>>>> >>>>> With this in mind, I want to open a discussion about how we should >>>>> standardize an interface for file listing. >>>>> >>>>> What should be our default interface for listing files? >>>>> >>>>> - Create our own definition (e.g., extend FileIO or create a new >>>>> Filesystem interface) >>>>> - Use Fsspec >>>>> - Use Arrow >>>>> - Use OpenDAL >>>>> - Other? >>>>> >>>>> Could we move the implementation for retrieving and wrapping the >>>>> Filesystem[4][5] to another location, so it can be reused elsewhere? >>>>> >>>>> Any other suggestions? >>>>> >>>>> [1] >>>>> https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356 >>>>> [2] >>>>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/fsspec.py#L350-L354 >>>>> [3]https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L346-L401 >>>>> [4] >>>>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1335-L1349 >>>>> [5] >>>>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1429-L1443 >>>>> >>>>> André Anastácio >> >> Xuanwo >> >> https://xuanwo.io/