Hi André,

First of all, thanks for raising this. Maintenance routines are a
long-awaited functionality in PyIceberg.

The FileIO concept <https://iceberg.apache.org/fileio/> is not limited to
PyIceberg, but is also present in Java
<https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/io/FileIO.java>
and
Iceberg-Rust
<https://github.com/apache/iceberg-rust/blob/bbbea9751439dea6afb85f5acf0f3689357cf3de/crates/iceberg/src/io/file_io.rs#L40>.
The main focus of FileIO is to provide object-store native operations to
the Iceberg client (an excellent blog can be found here
<https://tabular.io/blog/iceberg-fileio-cloud-native-tables/>). Based on
this, I don't think we want to create a first-class citizen for
FileSystem-like operations, because Iceberg is designed to work with object
stores native operations.

That said, in PyIceberg the abstraction between the engine and the FileIO
is not as clear as in other implementations. This is mostly because the
ArrowFileIO
<https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L328>
returns
Arrow buffers, and therefore we ended up with a more closely related
implementation than desired. It would be good to see if we can untangle
that, and I'm sure that once we get OpenDAL or Iceberg-Rust in there, there
will be a strong need to do that.

Orphan files is quite a resource-intensive operation since it requires
listing all the files under the location, and comparing this with all the
files in the metadata (I was hoping to leverage the metadata tables for
that).

Hope this helps!

Kind regards,
Fokko






Op ma 12 aug 2024 om 14:38 schreef André Luis Anastácio
<ndrl...@proton.me.invalid>:

> Hello everyone,
>
> I’ve been studying the Java implementation of orphan file removal to
> replicate it in PyIceberg. During this process, I noticed a key difference:
> in Java, we use the Hadoop Filesystem[1], while in PyIceberg, we use the
> Filesystem provided by FileIO[2][3].
>
> Currently, we support two FileIO implementations: Fsspec and PyArrow.
> However, there is a hard requirement to use PyArrow for the reading
> process, and when we instantiate the FileSystem, we wrap Fsspec with the
> PyArrow interface[4][5].
>
> Thus, we can say that the default filesystem interface is the PyArrow one.
>
> In the future, we aim to use the FileIO from rust-iceberg, which leverages
> OpenDAL—a tool that doesn’t have wrappers for the Fsspec or Arrow
> interfaces.
>
> For the FileIO context (write/read/delete operations), I believe we are in
> good shape. The challenge arises when we need to access the Filesystem
> object to handle tasks like listing files.
>
> With this in mind, I want to open a discussion about how we should
> standardize an interface for file listing.
>
> What should be our default interface for listing files?
>
> - Create our own definition (e.g., extend FileIO or create a new
> Filesystem interface)
> - Use Fsspec
> - Use Arrow
> - Use OpenDAL
> - Other?
>
> Could we move the implementation for retrieving and wrapping the
> Filesystem[4][5] to another location, so it can be reused elsewhere?
>
> Any other suggestions?
>
> [1]
> https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356
> [2]
> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/fsspec.py#L350-L354
> [3]
> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L346-L401
> [4]
> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1335-L1349
> [5]
> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1429-L1443
>
> André Anastácio
>

Reply via email to