I wanted to chime in that a current long term goal I am working towards is a Golang iceberg implementation that will also integrate with the Golang Arrow modules.
I'm not sure how much desire there is for it, but I do know at least two consumers that would greatly benefit from it. But, at least for the Go implementation, it's gonna be a bit before I can actually implement it fully. --Matt On Mon, Oct 3, 2022, 10:03 AM Will Jones <will.jones...@gmail.com> wrote: > Hi Rusty, > > Note we discussed Iceberg a while ago [1]. I don't think we've discussed > Hudi in any depth. > > As I see it, we are waiting on three things: > > 1. Someone willing to move forward the Iceberg / Hudi integration. > 2. The Iceberg and Hudi projects need native libraries that we can use. The > base implementations are all Java, which isn't practical to integrate with > our C++ implementation (and the Python/R/Ruby bindings). But I think these > formats are complex enough that it's best to develop the core > implementation within the respective community, rather than within the > Arrow repo. There was a discussion to start one a C++/Rust implementation > for Iceberg [2], but I haven't seen any progress so far. I haven't been > watching Hudi. > 3. We need a model for extending Arrow C++ datasets in separate packages, > or else we contribute to the package size problem you mentioned in your > other thread [3]. > > As a personal project, I've been working on integrating the Delta Lake Rust > implementation [4] with PyArrow. The community in that repo is pretty > invested in Arrow and has others working on integration with the Rust query > engines (such as Polars and DataFusion). Early next year I hope to extend > those to C++ and R, hopefully paving a path for solving issue (3) for the > other table formats. > > Best, > > Will Jones > > [1] https://issues.apache.org/jira/browse/ARROW-15135 > [2] https://lists.apache.org/thread/lf8gw4yk9c6l580o6k7mobg2y91rpjvp > [3] https://lists.apache.org/thread/mdr05pjzlq01dwwcwz21sz6ol3dkkylz > [4] https://github.com/delta-io/delta-rs > > On Mon, Oct 3, 2022 at 5:25 AM Rusty Conover <ru...@conover.me.invalid> > wrote: > > > Hi Arrow Team, > > > > Arrow is fantastic for manipulating the Parquet file format. > > > > There is an increasing desire to have the ability to update, delete and > > insert the rows stored in Parquet files, but without rewriting the > Parquet > > files in their entirety. It is not uncommon to have gigabytes/petabytes > of > > data stored in Parquet files, so having to rewrite all of it for an > update > > is non-trivial. > > > > The following projects promote that they can bring update/delete/insert > > support to Parquet: > > > > * Apache Hudi - https://hudi.apache.org/ > > * Apache Iceberg - https://iceberg.apache.org/ > > > > These projects combine a Parquet file with one or more "update" files > > stored using ORC or AVRO. Clients that want to read the rows combine the > > data stored in the Parquet file with the "update" files to determine > which > > rows exist. Occasionally, the formats "compact" the updates that have > > occurred and rewrite a new "optimized" Parquet file. > > > > Both projects require Apache Spark to be able to write data. > > > > I'd like to be able to use these formats in any language that Arrow > > supports, and I'd like to avoid the complexity of operating a Spark > > cluster. > > > > Since Arrow has support for tabular datasets and supports Parquet, is > there > > anything on the roadmap for Arrow to support these formats? > > > > These formats will most likely become increasingly popular in various > > industries. > > > > Rusty > > >