Hello everyone, I’ve been studying the Java implementation of orphan file removal to replicate it in PyIceberg. During this process, I noticed a key difference: in Java, we use the Hadoop Filesystem[1], while in PyIceberg, we use the Filesystem provided by FileIO[2][3].
Currently, we support two FileIO implementations: Fsspec and PyArrow. However, there is a hard requirement to use PyArrow for the reading process, and when we instantiate the FileSystem, we wrap Fsspec with the PyArrow interface[4][5]. Thus, we can say that the default filesystem interface is the PyArrow one. In the future, we aim to use the FileIO from rust-iceberg, which leverages OpenDAL—a tool that doesn’t have wrappers for the Fsspec or Arrow interfaces. For the FileIO context (write/read/delete operations), I believe we are in good shape. The challenge arises when we need to access the Filesystem object to handle tasks like listing files. With this in mind, I want to open a discussion about how we should standardize an interface for file listing. What should be our default interface for listing files? - Create our own definition (e.g., extend FileIO or create a new Filesystem interface) - Use Fsspec - Use Arrow - Use OpenDAL - Other? Could we move the implementation for retrieving and wrapping the Filesystem[4][5] to another location, so it can be reused elsewhere? Any other suggestions? [1] https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356 [2] https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/fsspec.py#L350-L354 [3]https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L346-L401 [4] https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1335-L1349 [5] https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1429-L1443 André Anastácio