Re: [DISCUSS] Filesystem in PyIceberg

André Luis Anastácio Mon, 12 Aug 2024 16:02:15 -0700

Thank you Fokko about the context! This blog post helped me a lot!

I understand that in the Iceberg Java implementation the maintenance procedures 
are just 
[interfaces](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/actions/DeleteOrphanFiles.java#L34),
 and the implementation is done on the [engine 
side](https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L103).
 What do you think about this for PyIceberg?


> I was hoping to leverage the metadata tables for that.

I’m not sure if I understand correctly. Do you mean that the idea would be to 
access the metadata using the metadata tables through the table public API 
instead of reading the metadata files directly?

If I understood correctly, and following what was done in the Java 
implementation, what are your thoughts on having the procedures module using 
only the PyIceberg public API and OpenDAL to handle with filesystem? With that, 
we would have something that is not coupled with the PyIceberg internals.

André Anastácio

On Monday, August 12th, 2024 at 5:03 PM, Fokko Driesprong <[email protected]> 
wrote:

> Hi André,
>
> First of all, thanks for raising this. Maintenance routines are a 
> long-awaited functionality in PyIceberg.
>
> The [FileIO concept](https://iceberg.apache.org/fileio/) is not limited to 
> PyIceberg, but is [also present in 
> Java](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/io/FileIO.java)
>  and 
> [Iceberg-Rust](https://github.com/apache/iceberg-rust/blob/bbbea9751439dea6afb85f5acf0f3689357cf3de/crates/iceberg/src/io/file_io.rs#L40).
>  The main focus of FileIO is to provide object-store native operations to the 
> Iceberg client (an excellent blog can be found 
> [here](https://tabular.io/blog/iceberg-fileio-cloud-native-tables/)). Based 
> on this, I don't think we want to create a first-class citizen for 
> FileSystem-like operations, because Iceberg is designed to work with object 
> stores native operations.
>
> That said, in PyIceberg the abstraction between the engine and the FileIO is 
> not as clear as in other implementations. This is mostly because the 
> [ArrowFileIO](https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L328)
>  returns Arrow buffers, and therefore we ended up with a more closely related 
> implementation than desired. It would be good to see if we can untangle that, 
> and I'm sure that once we get OpenDAL or Iceberg-Rust in there, there will be 
> a strong need to do that.
>
> Orphan files is quite a resource-intensive operation since it requires 
> listing all the files under the location, and comparing this with all the 
> files in the metadata (I was hoping to leverage the metadata tables for that).
>
> Hope this helps!
>
> Kind regards,
> Fokko
>
> Op ma 12 aug 2024 om 14:38 schreef André Luis Anastácio 
> <[email protected]>:
>
>> Hello everyone,
>>
>> I’ve been studying the Java implementation of orphan file removal to 
>> replicate it in PyIceberg. During this process, I noticed a key difference: 
>> in Java, we use the Hadoop Filesystem[1], while in PyIceberg, we use the 
>> Filesystem provided by FileIO[2][3].
>>
>> Currently, we support two FileIO implementations: Fsspec and PyArrow. 
>> However, there is a hard requirement to use PyArrow for the reading process, 
>> and when we instantiate the FileSystem, we wrap Fsspec with the PyArrow 
>> interface[4][5].
>>
>> Thus, we can say that the default filesystem interface is the PyArrow one.
>>
>> In the future, we aim to use the FileIO from rust-iceberg, which leverages 
>> OpenDAL—a tool that doesn’t have wrappers for the Fsspec or Arrow interfaces.
>>
>> For the FileIO context (write/read/delete operations), I believe we are in 
>> good shape. The challenge arises when we need to access the Filesystem 
>> object to handle tasks like listing files.
>>
>> With this in mind, I want to open a discussion about how we should 
>> standardize an interface for file listing.
>>
>> What should be our default interface for listing files?
>>
>> - Create our own definition (e.g., extend FileIO or create a new Filesystem 
>> interface)
>> - Use Fsspec
>> - Use Arrow
>> - Use OpenDAL
>> - Other?
>>
>> Could we move the implementation for retrieving and wrapping the 
>> Filesystem[4][5] to another location, so it can be reused elsewhere?
>> Any other suggestions?
>>
>> [1] 
>> https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356
>> [2] 
>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/fsspec.py#L350-L354
>> [3]https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L346-L401
>> [4] 
>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1335-L1349
>> [5] 
>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1429-L1443
>>
>> André Anastácio

Re: [DISCUSS] Filesystem in PyIceberg

Reply via email to