Hi Arrow Team, Arrow is fantastic for manipulating the Parquet file format.
There is an increasing desire to have the ability to update, delete and insert the rows stored in Parquet files, but without rewriting the Parquet files in their entirety. It is not uncommon to have gigabytes/petabytes of data stored in Parquet files, so having to rewrite all of it for an update is non-trivial. The following projects promote that they can bring update/delete/insert support to Parquet: * Apache Hudi - https://hudi.apache.org/ * Apache Iceberg - https://iceberg.apache.org/ These projects combine a Parquet file with one or more "update" files stored using ORC or AVRO. Clients that want to read the rows combine the data stored in the Parquet file with the "update" files to determine which rows exist. Occasionally, the formats "compact" the updates that have occurred and rewrite a new "optimized" Parquet file. Both projects require Apache Spark to be able to write data. I'd like to be able to use these formats in any language that Arrow supports, and I'd like to avoid the complexity of operating a Spark cluster. Since Arrow has support for tabular datasets and supports Parquet, is there anything on the roadmap for Arrow to support these formats? These formats will most likely become increasingly popular in various industries. Rusty