Re: [DISCUSS] Apache Iceberg / Apache Hudi support in Arrow

David Li Tue, 04 Oct 2022 05:00:34 -0700

It's possible we could wrap Iceberg et al. in Flight SQL to provide this, 
exposing Iceberg metadata via the Flight SQL endpoints, and table reads via 
Substrait plans. (Clients could send Substrait plans through ADBC, and we could 
integrate ADBC as a type of dataset.) I'm not familiar enough with Iceberg to 
know if having just the core libraries is enough, or if we need an attached 
query engine (like Spark) to support all of the features (like row-level 
updates/deletes).


On Mon, Oct 3, 2022, at 11:25, Antoine Pitrou wrote:
> Hi all,
>
> Le 03/10/2022 à 17:03, Will Jones a écrit :
>> Hi Rusty,
>> 
>> Note we discussed Iceberg a while ago [1]. I don't think we've discussed
>> Hudi in any depth.
>> 
>> As I see it, we are waiting on three things:
>> 
>> 1. Someone willing to move forward the Iceberg / Hudi integration.
>> 2. The Iceberg and Hudi projects need native libraries that we can use. The
>> base implementations are all Java, which isn't practical to integrate with
>> our C++ implementation (and the Python/R/Ruby bindings). But I think these
>> formats are complex enough that it's best to develop the core
>> implementation within the respective community, rather than within the
>> Arrow repo. There was a discussion to start one a C++/Rust implementation
>> for Iceberg [2], but I haven't seen any progress so far. I haven't been
>> watching Hudi.
>> 3. We need a model for extending Arrow C++ datasets in separate packages,
>> or else we contribute to the package size problem you mentioned in your
>> other thread [3].
>
> There may be other potential ways forward, such as integrate 
> Iceberg/Hudi using a Flight or ADBC endpoint.
>
> Regards
>
> Antoine.

Re: [DISCUSS] Apache Iceberg / Apache Hudi support in Arrow

Reply via email to