Re: [DISCUSS] C-level in-process array protocol

Wes McKinney Tue, 08 Oct 2019 13:35:13 -0700

hi Jacques,

On Tue, Oct 8, 2019 at 1:54 PM Jacques Nadeau <[email protected]> wrote:
>
> I removing all my objections to this work.
>
> I wish there was more feedback from additional community members. I continue 
> to be concerned about fragmentation. I don't agree with the arguments here 
> that we need to add a new api to make it easy for people to *not* use Arrow 
> codebase. It seems like a punt on building useful libraries within the 
> project that will ultimately hurt the interoperability story.
>


I think we'll have to take a "wait and see" approach. I believe the
community needs to build accessible libraries that offer value to
third party users, and we will continue to do that. I think there are
use cases here that fall outside of which library to use, but time
will tell.

> As a side note, it seems like much of this is about people's distaste for 
> flatbuffers. I know I regret using it. If we had a chance to do it over 
> again, I would have chosen to use protobuf for everything except the data 
> header, where I would hand write the encoding (since it is so simple anyway). 
> If it is such a problem that people are contorting to work around it, maybe 
> we should address that? Just a thought.
>

I think that using an Protobuf-like with IDL and a compiler presents a problem.

Note that Flatbuffers is much better for C/C++ programmers and I still
think it was the right choice for the project. Unlike Flatbuffers,
C/C++ applications must either link libprotobuf.so or libprotobuf.a.
Flatbuffers in C++ is a header-only dependency that is trivial to
bundle with a project [1]. The same is true for Thrift, and this came
up in the TF discussion [2]

[1]: 
https://github.com/apache/arrow/tree/master/cpp/thirdparty/flatbuffers/include/flatbuffers
[2]: https://github.com/tensorflow/community/pull/162#discussion_r332610486

> Thanks for the discourse and patience.
>
> On Wed, Oct 2, 2019 at 10:12 PM Micah Kornfield <[email protected]> wrote:
>>
>> Hi Wes,
>> I agree for third-parties "A" (Field data structures) is the most useful.
>>
>> At least in my mind the discussion was for both first and third-parties.  I
>> was trying to point out that "A" is less necessary as a first step for
>> first-party integrations and could potentially require more effort if we
>> already have the code that does "B" (field reassembly).
>>
>> Thanks,
>> Micah
>>
>> On Wed, Oct 2, 2019 at 10:28 PM Wes McKinney <[email protected]> wrote:
>>
>> > On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield <[email protected]>
>> > wrote:
>> > >
>> > > I've tried to summarize my understanding of the debate so far and give
>> > some
>> > > initial thoughts. I think there are two potentially different sets of
>> > users
>> > > that we are targeting with a stable C API/ABI ourselves and external
>> > > parties.
>> > >
>> > > 1.  Different language implementations within the Arrow project that want
>> > > to call into each other's code.  We still don't have a great story around
>> > > this in terms of reusable libraries and questions like [1] are a
>> > motivating
>> > > examples of making something better in this context.
>> > > 2.  third-parties wishing to support/integrate with Arrow.  Some
>> > > conjectures about these users:
>> > >   - Users in this group are NOT necessarily familiar with existing
>> > > technologies Arrow uses (i.e. flatbuffers)
>> > >   - The stability of the API is the primary concern (consumers don't want
>> > > to change when a new version of the library ships)
>> > >   - An important secondary concern is additional libraries that need to
>> > be
>> > > integrated in addition to the API
>> > >
>> > > The main debate points seems to be:
>> > >
>> > > 1.  Vector/Array oriented API vs existing Record Batch.  Will an
>> > additional
>> > > column oriented API become too much of a maintenance headache/cause
>> > > fragmentation?
>> > >
>> > >  - In my mind the question here is which set of users we are
>> > prioritizing.
>> > > IMO the combination of flatbuffers and translation to/from RecordBatch
>> > > format offers too much friction to make it easy for a third-party
>> > > implementer to use. If we are prioritizing for our own internal
>> > use-cases I
>> > > think we should try out a RecordBatch+Flatbuffers based C-API. We already
>> > > have all the necessary building blocks.
>> > >
>> >
>> > If a C function passes you a string containing a RecordBatch
>> > Flatbuffers message, what happens next? This message has to be
>> > reassembled into a recursive data structure before you can "do"
>> > anything with it. Are we expecting every third party project to
>> > implement:
>> >
>> > A. Data structures appropriate to represent a logical "field" in a
>> > record batch (which have to be recursive to account for nested types'
>> > children)
>> > B. The logic to convert from the flattened Flatbuffers representation
>> > to some implementation of A
>> >
>> > I'm arguing that we should provide both to third parties. To build B,
>> > you need A. Some consumers will only use A. This discussion is
>> > essentially about developing an ultraminimalist "drop-in" C
>> > implementation of A.
>> >
>> > > 2.  How onerous is the dependency on flat-buffers both from a learning
>> > > curve perspective and as dependency for third-party integrators?
>> > > - Flatbuffers aren't entirely straight-forward and I think if we do move
>> > > forward with an API based on Column/Array we should consider alternatives
>> > > as long as the necessary parsing code can be done in a small amount of
>> > code
>> > > (I'm personally against JSON for this, but can see the arguments for it).
>> > >
>> > > 3.  Do all existing library implementations need to support both
>> > > Column/Array a ABI?  How will compliance be checked for the new API/ABI?
>> > >
>> > > - I'm still thinking this through.
>> > >
>> > > [1]
>> > >
>> > https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E
>> > >
>> > > On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <[email protected]>
>> > wrote:
>> > >
>> > > > I'd like to hear more opinions from others on this topic. This
>> > conversation
>> > > > seems mostly dominated by comments from myself, Wes and Antoine.
>> > > >
>> > > > I think it is reasonable to argue that keeping any ABI (or
>> > header/struct
>> > > > pattern) as narrow as possible would allow us to minimize overlap with
>> > the
>> > > > existing in-memory specification. In Arrow's case, this could be as
>> > simple
>> > > > as a single memory pointer for schema (backed by flatbuffers) and a
>> > single
>> > > > memory location for data (that references the record batch header,
>> > which in
>> > > > turn provides pointers into the actual arrow data). Extensions would
>> > need
>> > > > to be added for reference management as done here but I continue to
>> > think
>> > > > we should defer discussion of that until the base data structures are
>> > > > resolved. I see the comments here as arguing for a much broader ABI, in
>> > > > part to support having people build "Arrow" components that
>> > interconnect
>> > > > using this new interface. I understand the desire to expand the ABI to
>> > be
>> > > > driven by needs to reduce dependencies and ease usability.
>> > > >
>> > > > The representation within the related patch is being presented as a
>> > way for
>> > > > applications to share Arrow data but is not easily accessible to all
>> > > > languages. I want to avoid a situation where someone says "I produced
>> > an
>> > > > Arrow API" when what they've really done is created a C interface which
>> > > > only a small subset of languages can actually leverage. For example,
>> > every
>> > > > language now knows how to parse the existing schema definition as
>> > rendered
>> > > > in flatbuf. In order to interact with something that implements this
>> > new
>> > > > pattern one would also be required to implement completely new schema
>> > > > consumption code. In the proposal itself it suggests this (for example
>> > > > enhancing the C++ library to consume structures produced this way).
>> > > >
>> > > > As I said, I really want to hear more opinions. Running this past
>> > various
>> > > > developers I know, many have echoed my concerns but that really doesn't
>> > > > matter (and who knows how much of that is colored by my presentation
>> > of the
>> > > > issue). What do people here think? If someone builds an "Arrow" library
>> > > > that implements this set of structures, how does one use it in Node? In
>> > > > Java? Does it drive creation of a secondary set of interfaces in each
>> > of
>> > > > those languages to work with this kind of pattern? (For example, in a
>> > JVM
>> > > > view of the world, working with a plain struct in java rather than a
>> > set of
>> > > > memory pointers against our existing IPC formats would be quite
>> > painful and
>> > > > we'd definitely need to create some glue code for users. I worry the
>> > same
>> > > > pattern would occur in many other languages.)
>> > > >
>> > > > To respond directly to some of Wes's most recent comments from the
>> > email
>> > > > below. I struggle to map your description of the situation to the rest
>> > of
>> > > > the thread and the proposed patch.  For example, you say that a
>> > non-goal is
>> > > > "creating a new canonical way to serialize metadata" bute the patch
>> > > > proposes a concrete string based encoding system to describe data
>> > types.
>> > > > Aren't those things in conflict?
>> > > >
>> > > > I'll also think more on this and challenge my own perspective. This
>> > isn't
>> > > > where my focus is so my comments aren't as developed/thoughtful as I'd
>> > > > like.
>> > > >
>> > > >
>> > > > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <[email protected]>
>> > wrote:
>> > > >
>> > > > > hi Jacques,
>> > > > >
>> > > > > I think we've veered off course a bit and maybe we could reframe the
>> > > > > discussion.
>> > > > >
>> > > > > Goals
>> > > > > * A "drop-in" header-only C file that projects can use as a
>> > > > > programming interface either internally only or to expose in-memory
>> > > > > data structures between C functions at call sites. Ideally little to
>> > > > > no disassembly/reassembly should be required on either "side" of the
>> > > > > call site.
>> > > > > * Simplifying adoption of Arrow for C programmers, or languages based
>> > > > > around C FFI
>> > > > >
>> > > > > Non-goals
>> > > > > * Expanding the columnar format or creating an alternative canonical
>> > > > > in-memory representation
>> > > > > * Creating a new canonical way to serialize metadata
>> > > > >
>> > > > > Note that this use case has been on my mind for more than 2 years:
>> > > > > https://issues.apache.org/jira/browse/ARROW-1058
>> > > > >
>> > > > > I think there are a couple of potentially misleading things at play
>> > here
>> > > > >
>> > > > > 1. The use of the word "protocol". In C, a struct has a well-defined
>> > > > > binary layout, so a C API is also an ABI. Using C structs to
>> > > > > communicate data can be considered to be a protocol, but it means
>> > > > > something different in the context of the "Arrow protocol". I think
>> > we
>> > > > > need to call this a "C API"
>> > > > >
>> > > > > 2. The documentation for this in Antoine's PR is in the format/
>> > > > > directory. It would probably be better to have a "C API" section in
>> > > > > the documentation.
>> > > > >
>> > > > > The header file under discussion and the documentation about it is
>> > > > > best considered as a "library".
>> > > > >
>> > > > > It might be useful at some point to create a C99 implementation of
>> > the
>> > > > > IPC protocol as well using FlatCC with the goal of having a complete
>> > > > > implementation of the columnar format in C with minimal binary
>> > > > > footprint. This is analogous to the NanoPB project which is an
>> > > > > implementation of Protocol Buffers with small code size
>> > > > >
>> > > > > https://github.com/nanopb/nanopb
>> > > > >
>> > > > > Let me know if this makes more sense.
>> > > > >
>> > > > > I think it's important to communicate clearly about this primarily
>> > for
>> > > > > the benefit of the outside world which can confuse easily as we have
>> > > > > observed over the last few years =)
>> > > > >
>> > > > > Wes
>> > > > >
>> > > > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <[email protected]>
>> > > > wrote:
>> > > > > >
>> > > > > > I disagree with this statement:
>> > > > > >
>> > > > > > - the IPC format is meant for serialization while the C data
>> > protocol
>> > > > is
>> > > > > > meants for in-memory communication, so different concerns apply
>> > > > > >
>> > > > > > If that is how the a particular implementation presents it, that
>> > is a
>> > > > > > weaknesses of the implementation, not the format. The primary use
>> > case
>> > > > I
>> > > > > > was focused on when working on the initial format was communication
>> > > > > within
>> > > > > > the same process. It seems like this is being used as a basis for
>> > the
>> > > > > > introduction of new things when the premise is inconsistent with
>> > the
>> > > > > > intention of the creation. The specific reason we used flatbuffers
>> > in
>> > > > the
>> > > > > > project was to collapse the separation of in-process and
>> > out-of-process
>> > > > > > communication. It means the same thing it does with the Arrow data
>> > > > > itself:
>> > > > > > that a consumer doesn't have to use a particular library to
>> > interact
>> > > > with
>> > > > > > and use the data.
>> > > > > >
>> > > > > > It seems like there are two ideas here:
>> > > > > >
>> > > > > > 1) How do we make it easier for people to use Arrow?
>> > > > > > 2) Should we implement a new in memory representation of Arrow
>> > that is
>> > > > > > language specific.
>> > > > > >
>> > > > > > I'm entirely in support of number one. If for a particular type of
>> > > > > domain,
>> > > > > > people want an easier way to interact with Arrow, let's make a new
>> > > > > library
>> > > > > > that helps with that. In easy of our current libraries, we do many
>> > > > things
>> > > > > > to make it easier to work with Arrow. None of those require a
>> > change to
>> > > > > the
>> > > > > > core format or are formalized as a new in-memory standard. The
>> > > > in-memory
>> > > > > > representation of rust or javascript or java objects are
>> > implementation
>> > > > > > details.
>> > > > > >
>> > > > > > I'm against number two as it creates a fragmentation problem.
>> > Arrow is
>> > > > > > about having a single canonical format for memory for both
>> > metadata and
>> > > > > > data. Having multiple in-memory formats (especially when some are
>> > not
>> > > > > > language independent) is counter to the goals of the project.
>> > > > >
>> > > > > I don't think anyone is proposing anything that would cause
>> > > > fragmentation.
>> > > > >
>> > > > > A central question is whether it is useful to define a reusable C ABI
>> > > > > for the Arrow columnar format, and if there is sufficient interest, a
>> > > > > tiny C implementation of the IPC protocol (which uses the Flatbuffers
>> > > > > message) that assembles and disassembles the data structures defined
>> > > > > in the C ABI.
>> > > > >
>> > > > > We could separately create a tiny implementation of the Arrow IPC
>> > > > > protocol using FlatCC that could be dropped into applications
>> > > > > requiring only a C compiler and nothing else.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > Two other, separate comments:
>> > > > > > 1) I don't understand the idea that we need to change the way Arrow
>> > > > > > fundamentally works so that people can avoid using a dependency.
>> > If the
>> > > > > > dependency is small, open source and easy to build, people can
>> > fork it
>> > > > > and
>> > > > > > include directly if they want to. Let's not violate project
>> > principles
>> > > > > > because DuckDB has a religious perspective on dependencies. If the
>> > > > > problem
>> > > > > > is people have to swallow too large of a pill to do basic things
>> > with
>> > > > > Arrow
>> > > > > > in C, let's focus on fixing that (to our definition of ease, not
>> > > > someone
>> > > > > > else's). If FlatCC solves some those things, great. If we need to
>> > > > build a
>> > > > > > baby integration library that is more C centric, great. Neither of
>> > > > those
>> > > > > > things require implementing something at the format level.
>> > > > > >
>> > > > > > 2) It seems like we should discuss the data structure problem
>> > > > separately
>> > > > > > from the reference management concern.
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <[email protected]>
>> > > > wrote:
>> > > > > >
>> > > > > > > hi Antoine,
>> > > > > > >
>> > > > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <
>> > [email protected]>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
>> > > > > > > > > A couple things:
>> > > > > > > > >
>> > > > > > > > > * I think a C protocol / FFI for Arrow array/vectors would be
>> > > > > better
>> > > > > > > > > to have the same "shape" as an assembled array. Note that
>> > the C
>> > > > > > > > > structs here have very nearly the same "shape" as the data
>> > > > > structure
>> > > > > > > > > representing a C++ Array object [1]. The disassembly and
>> > > > reassembly
>> > > > > > > > > here is substantially simpler than the IPC protocol. A
>> > recursive
>> > > > > > > > > structure in Flatbuffers would make RecordBatch messages much
>> > > > > larger,
>> > > > > > > > > so the flattened / disassembled representation we use for
>> > > > > serialized
>> > > > > > > > > record batches is the correct one
>> > > > > > > >
>> > > > > > > > I'm not sure I agree:
>> > > > > > > >
>> > > > > > > > - indeed, it's not a coincidence that the ArrowArray struct
>> > looks
>> > > > > quite
>> > > > > > > > closely like the C++ ArrayData object :-)  We have good
>> > experience
>> > > > > with
>> > > > > > > > that abstraction and it has proven to work quite well
>> > > > > > > >
>> > > > > > > > - the IPC format is meant for serialization while the C data
>> > > > > protocol is
>> > > > > > > > meants for in-memory communication, so different concerns apply
>> > > > > > > >
>> > > > > > > > - the fact that this makes the layout slightly larger doesn't
>> > seem
>> > > > > > > > important at all; we're not talking about transferring data
>> > over
>> > > > the
>> > > > > wire
>> > > > > > > >
>> > > > > > > > There's also another argument for having a recursive struct: it
>> > > > > > > > simplifies how the data type is represented, since we can
>> > encode
>> > > > each
>> > > > > > > > child type individually instead of encoding it in the parent's
>> > > > format
>> > > > > > > > string (same applies for metadata and individual flags).
>> > > > > > > >
>> > > > > > >
>> > > > > > > I was saying something different here. I was making an argument
>> > about
>> > > > > > > why we use the flattened array-of-structs in the IPC protocol.
>> > One
>> > > > > > > reason is that it's a more compact representation. That is not
>> > very
>> > > > > > > important here because this protocol is only for *in-process*
>> > (for
>> > > > > > > languages that have a C FFI facility) rather than *inter-process*
>> > > > > > > communication.
>> > > > > > >
>> > > > > > > I agree also that the type encoding is simple, here, too, since
>> > we
>> > > > > > > aren't having to split the schema and record batch between
>> > different
>> > > > > > > serialized messages. There is some potential waste with having to
>> > > > > > > populate the type fields multiple times when communicating a
>> > sequence
>> > > > > > > of "chunks" from the same logical dataset.
>> > > > > > >
>> > > > > > > > > * The "formal" C protocol having the "assembled" shape means
>> > that
>> > > > > many
>> > > > > > > > > minimal Arrow users won't have to implement any separate data
>> > > > > > > > > structures. They can just use the C struct directly or a
>> > slightly
>> > > > > > > > > wrapped version thereof with some convenience functions.
>> > > > > > > >
>> > > > > > > > Yes, but the same applies to the current proposal.
>> > > > > > > >
>> > > > > > > > > * I think that requiring building a Flatbuffer for minimal
>> > use
>> > > > > cases
>> > > > > > > > > (e.g. communicating simple record batches with primitive
>> > types)
>> > > > > passes
>> > > > > > > > > on implementation burden to minimal users.
>> > > > > > > >
>> > > > > > > > It certainly does.
>> > > > > > > >
>> > > > > > > > > I think the mantra of the C protocol should be the following:
>> > > > > > > > >
>> > > > > > > > > * Users of the protocol have to write little to no code to
>> > use
>> > > > it.
>> > > > > For
>> > > > > > > > > example, populating an INT32 array should require only a few
>> > > > lines
>> > > > > of
>> > > > > > > > > code
>> > > > > > > >
>> > > > > > > > Agreed.  As a sidenote, the spec should have an example of
>> > doing
>> > > > > this in
>> > > > > > > > raw C.
>> > > > > > > >
>> > > > > > > > Regards
>> > > > > > > >
>> > > > > > > > Antoine.
>> > > > > > >
>> > > > >
>> > > >
>> >

Re: [DISCUSS] C-level in-process array protocol

Reply via email to