[DISCUSS] Apache Iceberg / Apache Hudi support in Arrow

Rusty Conover Mon, 03 Oct 2022 05:25:38 -0700

Hi Arrow Team,

Arrow is fantastic for manipulating the Parquet file format.


There is an increasing desire to have the ability to update, delete and
insert the rows stored in Parquet files, but without rewriting the Parquet
files in their entirety.  It is not uncommon to have gigabytes/petabytes of
data stored in Parquet files, so having to rewrite all of it for an update
is non-trivial.

The following projects promote that they can bring update/delete/insert
support to Parquet:

* Apache Hudi - https://hudi.apache.org/
* Apache Iceberg - https://iceberg.apache.org/

These projects combine a Parquet file with one or more "update" files
stored using ORC or AVRO. Clients that want to read the rows combine the
data stored in the Parquet file with the "update" files to determine which
rows exist.  Occasionally, the formats "compact" the updates that have
occurred and rewrite a new "optimized" Parquet file.

Both projects require Apache Spark to be able to write data.

I'd like to be able to use these formats in any language that Arrow
supports, and I'd like to avoid the complexity of operating a Spark cluster.

Since Arrow has support for tabular datasets and supports Parquet, is there
anything on the roadmap for Arrow to support these formats?

These formats will most likely become increasingly popular in various
industries.

Rusty

[DISCUSS] Apache Iceberg / Apache Hudi support in Arrow

Reply via email to