Re: [DISCUSS] Apache Iceberg / Apache Hudi support in Arrow

Matt Topol Mon, 03 Oct 2022 08:16:22 -0700

I wanted to chime in that a current long term goal I am working towards is
a Golang iceberg implementation that will also integrate with the Golang
Arrow modules.


I'm not sure how much desire there is for it, but I do know at least two
consumers that would greatly benefit from it. But, at least for the Go
implementation, it's gonna be a bit before I can actually implement it
fully.

--Matt

On Mon, Oct 3, 2022, 10:03 AM Will Jones <will.jones...@gmail.com> wrote:

> Hi Rusty,
>
> Note we discussed Iceberg a while ago [1]. I don't think we've discussed
> Hudi in any depth.
>
> As I see it, we are waiting on three things:
>
> 1. Someone willing to move forward the Iceberg / Hudi integration.
> 2. The Iceberg and Hudi projects need native libraries that we can use. The
> base implementations are all Java, which isn't practical to integrate with
> our C++ implementation (and the Python/R/Ruby bindings). But I think these
> formats are complex enough that it's best to develop the core
> implementation within the respective community, rather than within the
> Arrow repo. There was a discussion to start one a C++/Rust implementation
> for Iceberg [2], but I haven't seen any progress so far. I haven't been
> watching Hudi.
> 3. We need a model for extending Arrow C++ datasets in separate packages,
> or else we contribute to the package size problem you mentioned in your
> other thread [3].
>
> As a personal project, I've been working on integrating the Delta Lake Rust
> implementation [4] with PyArrow. The community in that repo is pretty
> invested in Arrow and has others working on integration with the Rust query
> engines (such as Polars and DataFusion). Early next year I hope to extend
> those to C++ and R, hopefully paving a path for solving issue (3) for the
> other table formats.
>
> Best,
>
> Will Jones
>
> [1] https://issues.apache.org/jira/browse/ARROW-15135
> [2] https://lists.apache.org/thread/lf8gw4yk9c6l580o6k7mobg2y91rpjvp
> [3] https://lists.apache.org/thread/mdr05pjzlq01dwwcwz21sz6ol3dkkylz
> [4] https://github.com/delta-io/delta-rs
>
> On Mon, Oct 3, 2022 at 5:25 AM Rusty Conover <ru...@conover.me.invalid>
> wrote:
>
> > Hi Arrow Team,
> >
> > Arrow is fantastic for manipulating the Parquet file format.
> >
> > There is an increasing desire to have the ability to update, delete and
> > insert the rows stored in Parquet files, but without rewriting the
> Parquet
> > files in their entirety.  It is not uncommon to have gigabytes/petabytes
> of
> > data stored in Parquet files, so having to rewrite all of it for an
> update
> > is non-trivial.
> >
> > The following projects promote that they can bring update/delete/insert
> > support to Parquet:
> >
> > * Apache Hudi - https://hudi.apache.org/
> > * Apache Iceberg - https://iceberg.apache.org/
> >
> > These projects combine a Parquet file with one or more "update" files
> > stored using ORC or AVRO. Clients that want to read the rows combine the
> > data stored in the Parquet file with the "update" files to determine
> which
> > rows exist.  Occasionally, the formats "compact" the updates that have
> > occurred and rewrite a new "optimized" Parquet file.
> >
> > Both projects require Apache Spark to be able to write data.
> >
> > I'd like to be able to use these formats in any language that Arrow
> > supports, and I'd like to avoid the complexity of operating a Spark
> > cluster.
> >
> > Since Arrow has support for tabular datasets and supports Parquet, is
> there
> > anything on the roadmap for Arrow to support these formats?
> >
> > These formats will most likely become increasingly popular in various
> > industries.
> >
> > Rusty
> >
>

Re: [DISCUSS] Apache Iceberg / Apache Hudi support in Arrow

Reply via email to