Re: [DISCUSS] Filesystem in PyIceberg

André Luis Anastácio Tue, 13 Aug 2024 06:54:25 -0700

I believe that now I understand how to leverage the metadata tables to deal 
with removing orphan files.


I didn't know that the DELETE_FILES metadata table existed, so I believe this 
is what Fokko meant.

Fokko, was your idea to use the DELETE_FILES and ALL_FILES metadata tables? Do 
you know why these metadata tables are not used in the Spark implementation?

Java doc reference: 
https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/MetadataTableType.html

André Anastácio

On Tuesday, August 13th, 2024 at 7:34 AM, Steve Loughran 
<ste...@cloudera.com.INVALID> wrote:

> On Tue, 13 Aug 2024 at 03:50, Xuanwo <xua...@apache.org> wrote:
>
>> Hi, André
>>
>> Thanks a lot for starting this thread.
>>
>> List operations on storage services are expensive and slow. That's why 
>> Iceberg is designed to store metadata in files and avoid using list 
>> operations in FileIO. However, `orphan file removal` or `garbage cleanup` 
>> are special tasks that do require scanning the entire storage location and 
>> comparing it with our existing metadata files.
>
> Not quite.
>
> Listing via treewalking is awful because it has both high latency on "pure" 
> object stores with client-side mimiced directories. And you will pay per LIST 
> call.
>
> In S3, as implemented by S3FileIO and HadoopIO, the the 
> SupportsPrefixOperations.listPrefix() operation is independent of directory 
> structure and instead just O(files). Results come back in pages, about 1000 
> or so. If you have versioned buckets you'll get less if there are many 
> overwritten/tombstoned objects. To compensate for this you should really 
> schedule processing of data into separate threads from that during the 
> listing. This is actually beneficial on classic tree walks on high-latency 
> stores with "read" directories -including Azure ADLS Gen2.
>
>> I believe that if there is a way to ensure all engines use List operations 
>> correctly ( don't abuse list! ), it would be beneficial for us to introduce 
>> list files in FileIO.
>
>> I believe that if there is a way to ensure all engines use List operations 
>> correctly ( don't abuse list! ), it would be beneficial for us to introduce 
>> list files in FileIO.
>
> Given SupportsPrefixOperations exists: use listPrefix()
> Similarly, use SupportsBulkOperations.deleteFiles() for bulk deletion
>
> You should actually be able to wire them up, either directly: 
> deleteFiles(listPrefix(path)), or more interestingly, with a filter in 
> between. This would integrate paged LIST results with paged single/bulk 
> delete calls.
>
> The S3FileIO.deleteFiles() and the hadoop 3.4.1 variant (which will be ready 
> for review once we ship that) https://github.com/apache/iceberg/pull/10233 
> can both do the bulk delete in aggregate calls, which S3FileIO well actually 
> do asynchronously. Each row in the batch counts as one write operation and is 
> trivial to trigger throttling; if the AWS SDK is doing the retries you 
> wouldn't even directly notice it -but all clients writing to that S3 shard 
> will be delayed. Being aggressive here is a bit antisocial for any background 
> vacuuming task.
>
> Anyway: use listFiles(), but know that even if deleteFiles() is optimised for 
> cloud storage it can be slow and impact every other application writing to 
> the same store. And someone should update 
> org.apache.iceberg.aws.util.RetryDetector to count throttle events the way we 
> do in the s3a codebase.
>
>> I prefer to have this in FileIO and eventually exposed in 
>> pyicberg/iceberg-rust's public API instead of letting users use opendal 
>> directly. The public API could be a metadata table or something similar; I 
>> haven't given it much thought yet.
>>
>> FileIO is now a widely shared design across different language 
>> implementations, and we have built a mature mechanism to allow users to 
>> implement and provide their own FileIO. By adding a new API in FileIO, we 
>> can ensure that we are not favoring any specific FileIO implementation.
>
> Given SupportsPrefixOperations is there, just use that if the FileIO instance 
> supports it.
>
>> On Tue, Aug 13, 2024, at 07:01, André Luis Anastácio wrote:
>>
>>> Thank you Fokko about the context! This blog post helped me a lot!
>>>
>>> I understand that in the Iceberg Java implementation the maintenance 
>>> procedures are just 
>>> [interfaces](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/actions/DeleteOrphanFiles.java#L34),
>>>  and the implementation is done on the [engine 
>>> side](https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L103).
>>>  What do you think about this for PyIceberg?
>>>
>>>> I was hoping to leverage the metadata tables for that.
>>>
>>> I’m not sure if I understand correctly. Do you mean that the idea would be 
>>> to access the metadata using the metadata tables through the table public 
>>> API instead of reading the metadata files directly?
>>>
>>> If I understood correctly, and following what was done in the Java 
>>> implementation, what are your thoughts on having the procedures module 
>>> using only the PyIceberg public API and OpenDAL to handle with filesystem? 
>>> With that, we would have something that is not coupled with the PyIceberg 
>>> internals.
>>>
>>> André Anastácio
>>>
>>> On Monday, August 12th, 2024 at 5:03 PM, Fokko Driesprong 
>>> <fo...@apache.org> wrote:
>>>
>>>> Hi André,
>>>>
>>>> First of all, thanks for raising this. Maintenance routines are a 
>>>> long-awaited functionality in PyIceberg.
>>>>
>>>> The [FileIO concept](https://iceberg.apache.org/fileio/) is not limited to 
>>>> PyIceberg, but is [also present in 
>>>> Java](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/io/FileIO.java)
>>>>  and 
>>>> [Iceberg-Rust](https://github.com/apache/iceberg-rust/blob/bbbea9751439dea6afb85f5acf0f3689357cf3de/crates/iceberg/src/io/file_io.rs#L40).
>>>>  The main focus of FileIO is to provide object-store native operations to 
>>>> the Iceberg client (an excellent blog can be found 
>>>> [here](https://tabular.io/blog/iceberg-fileio-cloud-native-tables/)). 
>>>> Based on this, I don't think we want to create a first-class citizen for 
>>>> FileSystem-like operations, because Iceberg is designed to work with 
>>>> object stores native operations.
>>>>
>>>> That said, in PyIceberg the abstraction between the engine and the FileIO 
>>>> is not as clear as in other implementations. This is mostly because the 
>>>> [ArrowFileIO](https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L328)
>>>>  returns Arrow buffers, and therefore we ended up with a more closely 
>>>> related implementation than desired. It would be good to see if we can 
>>>> untangle that, and I'm sure that once we get OpenDAL or Iceberg-Rust in 
>>>> there, there will be a strong need to do that.
>>>>
>>>> Orphan files is quite a resource-intensive operation since it requires 
>>>> listing all the files under the location, and comparing this with all the 
>>>> files in the metadata (I was hoping to leverage the metadata tables for 
>>>> that).
>>>>
>>>> Hope this helps!
>>>>
>>>> Kind regards,
>>>> Fokko
>>>>
>>>> Op ma 12 aug 2024 om 14:38 schreef André Luis Anastácio 
>>>> <ndrl...@proton.me.invalid>:
>>>>
>>>>> Hello everyone,
>>>>>
>>>>> I’ve been studying the Java implementation of orphan file removal to 
>>>>> replicate it in PyIceberg. During this process, I noticed a key 
>>>>> difference: in Java, we use the Hadoop Filesystem[1], while in PyIceberg, 
>>>>> we use the Filesystem provided by FileIO[2][3].
>>>>>
>>>>> Currently, we support two FileIO implementations: Fsspec and PyArrow. 
>>>>> However, there is a hard requirement to use PyArrow for the reading 
>>>>> process, and when we instantiate the FileSystem, we wrap Fsspec with the 
>>>>> PyArrow interface[4][5].
>>>>>
>>>>> Thus, we can say that the default filesystem interface is the PyArrow one.
>>>>>
>>>>> In the future, we aim to use the FileIO from rust-iceberg, which 
>>>>> leverages OpenDAL—a tool that doesn’t have wrappers for the Fsspec or 
>>>>> Arrow interfaces.
>>>>>
>>>>> For the FileIO context (write/read/delete operations), I believe we are 
>>>>> in good shape. The challenge arises when we need to access the Filesystem 
>>>>> object to handle tasks like listing files.
>>>>>
>>>>> With this in mind, I want to open a discussion about how we should 
>>>>> standardize an interface for file listing.
>>>>>
>>>>> What should be our default interface for listing files?
>>>>>
>>>>> - Create our own definition (e.g., extend FileIO or create a new 
>>>>> Filesystem interface)
>>>>> - Use Fsspec
>>>>> - Use Arrow
>>>>> - Use OpenDAL
>>>>> - Other?
>>>>>
>>>>> Could we move the implementation for retrieving and wrapping the 
>>>>> Filesystem[4][5] to another location, so it can be reused elsewhere?
>>>>>
>>>>> Any other suggestions?
>>>>>
>>>>> [1] 
>>>>> https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356
>>>>> [2] 
>>>>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/fsspec.py#L350-L354
>>>>> [3]https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L346-L401
>>>>> [4] 
>>>>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1335-L1349
>>>>> [5] 
>>>>> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1429-L1443
>>>>>
>>>>> André Anastácio
>>
>> Xuanwo
>>
>> https://xuanwo.io/

Re: [DISCUSS] Filesystem in PyIceberg

Reply via email to