Hi André, First of all, thanks for raising this. Maintenance routines are a long-awaited functionality in PyIceberg.
The FileIO concept <https://iceberg.apache.org/fileio/> is not limited to PyIceberg, but is also present in Java <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/io/FileIO.java> and Iceberg-Rust <https://github.com/apache/iceberg-rust/blob/bbbea9751439dea6afb85f5acf0f3689357cf3de/crates/iceberg/src/io/file_io.rs#L40>. The main focus of FileIO is to provide object-store native operations to the Iceberg client (an excellent blog can be found here <https://tabular.io/blog/iceberg-fileio-cloud-native-tables/>). Based on this, I don't think we want to create a first-class citizen for FileSystem-like operations, because Iceberg is designed to work with object stores native operations. That said, in PyIceberg the abstraction between the engine and the FileIO is not as clear as in other implementations. This is mostly because the ArrowFileIO <https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L328> returns Arrow buffers, and therefore we ended up with a more closely related implementation than desired. It would be good to see if we can untangle that, and I'm sure that once we get OpenDAL or Iceberg-Rust in there, there will be a strong need to do that. Orphan files is quite a resource-intensive operation since it requires listing all the files under the location, and comparing this with all the files in the metadata (I was hoping to leverage the metadata tables for that). Hope this helps! Kind regards, Fokko Op ma 12 aug 2024 om 14:38 schreef André Luis Anastácio <ndrl...@proton.me.invalid>: > Hello everyone, > > I’ve been studying the Java implementation of orphan file removal to > replicate it in PyIceberg. During this process, I noticed a key difference: > in Java, we use the Hadoop Filesystem[1], while in PyIceberg, we use the > Filesystem provided by FileIO[2][3]. > > Currently, we support two FileIO implementations: Fsspec and PyArrow. > However, there is a hard requirement to use PyArrow for the reading > process, and when we instantiate the FileSystem, we wrap Fsspec with the > PyArrow interface[4][5]. > > Thus, we can say that the default filesystem interface is the PyArrow one. > > In the future, we aim to use the FileIO from rust-iceberg, which leverages > OpenDAL—a tool that doesn’t have wrappers for the Fsspec or Arrow > interfaces. > > For the FileIO context (write/read/delete operations), I believe we are in > good shape. The challenge arises when we need to access the Filesystem > object to handle tasks like listing files. > > With this in mind, I want to open a discussion about how we should > standardize an interface for file listing. > > What should be our default interface for listing files? > > - Create our own definition (e.g., extend FileIO or create a new > Filesystem interface) > - Use Fsspec > - Use Arrow > - Use OpenDAL > - Other? > > Could we move the implementation for retrieving and wrapping the > Filesystem[4][5] to another location, so it can be reused elsewhere? > > Any other suggestions? > > [1] > https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356 > [2] > https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/fsspec.py#L350-L354 > [3] > https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L346-L401 > [4] > https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1335-L1349 > [5] > https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1429-L1443 > > André Anastácio >