[ 
https://issues.apache.org/jira/browse/ARROW-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485561#comment-17485561
 ] 

Will Jones commented on ARROW-15135:
------------------------------------

I was reading over the Iceberg spec, and it occurred to me that a dataset 
factory wouldn't be sufficient for Iceberg tables. Two main complications:

 # Columns are stored in files with UUIDs as their names. Once read, they are 
supposed to be mapped to their real column names. [See docs 
here|https://iceberg.apache.org/#spec/#column-projection].
 # V2 of the spec allows for "row-level deletes", where metadata files may 
store either specific rows to delete or a delete predicate, which must be 
filtered out on read. There are some somewhat complex rules in how these are 
supposed to be applied. [See deletes 
docs|https://iceberg.apache.org/#spec/#delete-formats].

The [Scan Planning|https://iceberg.apache.org/#spec/#scan-planning] section of 
the doc gives a good overview. 

I could see #1 (column mapping) to be something we could add as a feature to 
existing datasets. 

But for #2, it seems to me that we would need to implement an IcebergDataset 
and IcebergFragment like Weston suggested.

> [C++][R][Python] Support reading from Apache Iceberg tables
> -----------------------------------------------------------
>
>                 Key: ARROW-15135
>                 URL: https://issues.apache.org/jira/browse/ARROW-15135
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Will Jones
>            Priority: Major
>
> This is an umbrella issue for supporting the [Apache Iceberg table 
> format|https://iceberg.apache.org/].
> Dremio has a good overview of the format here: 
> https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to