Re: [DISCUSS] Splitting out the Arrow format directory

Neal Richardson Thu, 12 Aug 2021 06:15:52 -0700

> Maintain this "Arrow types and ComputeIR library" as an always
zero-dependency library to facilitate vendoring


Would/should this hypothetical zero-dep, vendorable library also include
the IPC format? Or if you want to interact with IPC in that case, the C
data interface is the best/only option?

On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney <[email protected]> wrote:

> It seems that one adjacent problem here is how to make it simpler for
> third parties (especially ones that act as front end interfaces) to
> build and serialize/deserialize the IR structures with some kind of
> ready-to-go middleware library, written in a language like C++.
>
> To do that, one would need the equivalent of arrow/type.h and related
> Flatbuffers schema serialization code that lives in arrow/ipc. If you
> want to be able to completely and accurately serialize Schemas, you
> need quite a bit of code now.
>
> One possible approach (and not go crazy) would be to:
>
> * Move arrow/types.h and its dependencies into a standalone C++
> library that can be vendored into the main apache/arrow C++ library. I
> don't know how onerous arrow/types.h's transitive dependencies /
> interactions are at this point (there's a lot of stuff going on in
> type.cc [1] now)
> * Make the namespaces exported by this library configurable, so any
> library can vendor the Arrow types / IR builder APIs privately into
> their project
> * Maintain this "Arrow types and ComputeIR library" as an always
> zero-dependency library to facilitate vendoring
> * Lightweight bindings in languages we care about (like Python or R or
> GLib/Ruby) could be built to the IR builder middleware library
>
> This seems like what is more at issue compared with rather projects
> are copying the Flatbuffers files out of their project from
> apache/arrow or apache/arrow-format.
>
> [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
>
> On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb <[email protected]> wrote:
> >
> > I support the idea of an independent repo that has the arrow flatbuffers
> > format definition files.
> >
> > My rationale is that the Rust implementation has a copy of the `format`
> > directory [1] and potential drift worries me (a bit). Having a single
> > source of truth for the format that is not part of the large mono repo
> > would be a good thing.
> >
> > Andrew
> >
> > [1] https://github.com/apache/arrow-rs/tree/master/format
> >
> > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I'd like to bring up an idea from a recent thread ([1]) about moving
> the
> > > `format/` directory out of the primary apache/arrow repository.
> > >
> > > I understand from that thread there are some concerns about using
> > > submodules,
> > > and I definitely sympathize with those concerns.
> > >
> > > In talking with David Li (disclaimer: we work together at Voltron
> Data), he
> > > has
> > > a great idea that I think makes everyone happy: an
> `apache/arrow-format`
> > > repository that is the official mirror for the flatbuffers IDL, that
> > > library
> > > authors should use as the source of truth.
> > >
> > > It doesn't require a submodule, yet it also allows external projects
> the
> > > ability to access the IDL without having to interact with the main
> arrow
> > > repository and is backwards compatible to boot.
> > >
> > > In this scenario, repositories that are currently copying in the
> > > flatbuffers
> > > IDL can migrate to this repository at their leisure.
> > >
> > > My motivation for this was around sharing data structures for the
> compute
> > > IR
> > > proposal ([2]).
> > >
> > > I can think of at least two ways for IR producers and consumers of all
> > > languages to share the flatbuffers IDL:
> > >
> > > 1. A set of bindings built in some language that other languages can
> > > integrate
> > >    with, likely C++, that allows library users to build IR using an
> API.
> > >
> > > The primary downside to this is that we'd have to deal with
> > > building another library while working out any kinks in the IR design
> and
> > > I'd
> > > rather avoid that in the initial phases of this project.
> > >
> > > The benefit is that IR components don't interact much with
> `flatbuffers` or
> > > `flatc` directly.
> > >
> > > 2. A single location where the format lives, that doesn't require
> depending
> > > on
> > >    a large multi-language repository to access a handful of files.
> > >
> > > I think the downside to this is that there's a bit of additional
> > > infrastructure
> > > to automate copying in `arrow-format`.
> > >
> > > The benefit there is that producers and consumers can immediately start
> > > getting
> > > value from compute IR without having to wait for development of a new
> API.
> > >
> > > One counter-proposal might be to just put the compute IR IDL in a
> separate
> > > repo,
> > > but that isn't tenable because the compute IR needs arrow's type
> > > information
> > > contained in `Schema.fbs`.
> > >
> > > I was hoping to avoid conflating the discussion about bindings vs
> direct
> > > flatbuffer usage (at least initially just supporting one, I predict
> we'll
> > > need
> > > both ultimately) with the decision about whether to split out the
> format
> > > directory, but it's a good example of a choice for which splitting out
> the
> > > format directory would be well-served.
> > >
> > > I'll note that this doesn't block anything on the compute IR side, just
> > > wanted
> > > to surface this in a parallel thread and see what folks think.
> > >
> > > [1]:
> > >
> > >
> https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E
> > > [2]:
> > >
> > >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l
> > >
>

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to