[
https://issues.apache.org/jira/browse/ARROW-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485561#comment-17485561
]
Will Jones commented on ARROW-15135:
------------------------------------
I was reading over the Iceberg spec, and it occurred to me that a dataset
factory wouldn't be sufficient for Iceberg tables. Two main complications:
# Columns are stored in files with UUIDs as their names. Once read, they are
supposed to be mapped to their real column names. [See docs
here|https://iceberg.apache.org/#spec/#column-projection].
# V2 of the spec allows for "row-level deletes", where metadata files may
store either specific rows to delete or a delete predicate, which must be
filtered out on read. There are some somewhat complex rules in how these are
supposed to be applied. [See deletes
docs|https://iceberg.apache.org/#spec/#delete-formats].
The [Scan Planning|https://iceberg.apache.org/#spec/#scan-planning] section of
the doc gives a good overview.
I could see #1 (column mapping) to be something we could add as a feature to
existing datasets.
But for #2, it seems to me that we would need to implement an IcebergDataset
and IcebergFragment like Weston suggested.
> [C++][R][Python] Support reading from Apache Iceberg tables
> -----------------------------------------------------------
>
> Key: ARROW-15135
> URL: https://issues.apache.org/jira/browse/ARROW-15135
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Will Jones
> Priority: Major
>
> This is an umbrella issue for supporting the [Apache Iceberg table
> format|https://iceberg.apache.org/].
> Dremio has a good overview of the format here:
> https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/
--
This message was sent by Atlassian Jira
(v8.20.1#820001)