Hello everyone,

I’ve been studying the Java implementation of orphan file removal to replicate 
it in PyIceberg. During this process, I noticed a key difference: in Java, we 
use the Hadoop Filesystem[1], while in PyIceberg, we use the Filesystem 
provided by FileIO[2][3].

Currently, we support two FileIO implementations: Fsspec and PyArrow. However, 
there is a hard requirement to use PyArrow for the reading process, and when we 
instantiate the FileSystem, we wrap Fsspec with the PyArrow interface[4][5].

Thus, we can say that the default filesystem interface is the PyArrow one.

In the future, we aim to use the FileIO from rust-iceberg, which leverages 
OpenDAL—a tool that doesn’t have wrappers for the Fsspec or Arrow interfaces.

For the FileIO context (write/read/delete operations), I believe we are in good 
shape. The challenge arises when we need to access the Filesystem object to 
handle tasks like listing files.

With this in mind, I want to open a discussion about how we should standardize 
an interface for file listing.

What should be our default interface for listing files?

- Create our own definition (e.g., extend FileIO or create a new Filesystem 
interface)
- Use Fsspec
- Use Arrow
- Use OpenDAL
- Other?

Could we move the implementation for retrieving and wrapping the 
Filesystem[4][5] to another location, so it can be reused elsewhere?
Any other suggestions?

[1] 
https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356
[2] 
https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/fsspec.py#L350-L354
[3]https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L346-L401
[4] 
https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1335-L1349
[5] 
https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1429-L1443

André Anastácio

Reply via email to