Re: [DISCUSS] C Data Interface, take 2

Wes McKinney Tue, 21 Jan 2020 15:03:14 -0800

Thanks Jacques. I agree that none of the ways forward on this problem
are wholly satisfactory. We should encourage users of this C API to
prefer emitting byte-aligned / 0-offset in line with the IPC spec
wherever possible. It will be interesting to see after a period of
time how downstream projects are able to leverage this interface as
part of their overall Arrow adoption.


On Tue, Jan 21, 2020 at 4:05 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
> Upon further reflection (and as I've noted on the PR), I think merging the
> ABI as a general feature of Arrow is preferable to making this be a
> subinterface of the C++ part of the project. While the offset field is
> awkward given its absence from the IPC spec, it's better to avoid
> fragmenting the community based on that fields absence or existence.
>
> Thanks for the lively discussion Antoine, Wes and others!
>
> J
>
> On Mon, Jan 20, 2020 at 11:09 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > Independent of the particulars of the discussion, the C++ project
> > needs to be free to create a C API for itself. If you want to try to
> > block the C++ contributors from doing this we may be barreling toward
> > a governance crisis in the project. I'm stepping back from this
> > discussion for a time now to allow others to catch up on the
> > discussion and to weigh in as needed
> >
> > On Mon, Jan 20, 2020 at 1:00 PM Jacques Nadeau <jacq...@apache.org> wrote:
> > >
> > > I don't see this as an endogenous concern of the C++ project. I
> > appreciate
> > > your goal with saying so but I think this has broader ramifications
> > around
> > > fragmentation of the project.
> > >
> > > The core challenge that we're dealing with is we introduced foundational
> > > concepts in some implementations that go beyond the spec and then
> > provided
> > > useful features based on them (in this case, the offset concept).
> > Ideally,
> > > those concepts are first introduced at the specification level so there
> > > aren't inconsistent viewpoints of what Arrow is (which I believe is what
> > is
> > > happening here). Having a cross-language specification for in-memory
> > > processing is a new concept so it isn't surprising that we're going to
> > > learn these things along the way.
> > >
> > > Without this, we create a slippery slope of fragmentation between the
> > > specifications and the implementations. I understand that the toothpaste
> > is
> > > out of the tube in this particular case. We can respond in two ways: stop
> > > the slip or continue to slide down the slope. I'm inclined to stop the
> > slip.
> > >
> > > As I said on the GitHub, I'm struggling with how much of this should be
> > > solved in the project. I'm going to pause a bit on responding to reflect
> > > further about this as well to reduce the likelihood that this devolves
> > into
> > > a flame war (which is always a risk with complex issues such as these).
> > >
> > >
> > >
> > > On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > hi Jacques,
> > > >
> > > > Taking a step back from the discussion, the original problem statement
> > > > was to enable third party projects to produce the data structure used
> > > > by C++ Array classes in C without depending on the C++ code
> > > >
> > > > That's the ArrayData class here
> > > >
> > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232
> > > >
> > > > It is important for us simplify the programming interface with the C++
> > > > library, so I think that we should address this as an endogenous
> > > > concern of the C++ project, namely providing a "C API for the C++
> > > > project". The C API for the C++ library needs to mirror what's in the
> > > > C++ project (i.e. the ArrayData data structure). We should not
> > > > advertise this as being a part of the project specification.
> > > >
> > > > - Wes
> > > >
> > > > On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau <jacq...@apache.org>
> > > > wrote:
> > > > >
> > > > > As I noted on the pull request, I think fundamentally this work is at
> > > > odds
> > > > > with the Arrow specification and being used to introduce a shadow
> > > > > specification.
> > > > >
> > > > > I don't think our intentions about how people should use something
> > really
> > > > > influence how people will actually use or perceive it. They'll just
> > find
> > > > > supported Arrow code and expose things based on it and call it "Arrow
> > > > > compatible". In other words, I don't think people in the outside
> > world
> > > > will
> > > > > be able to perceive the distinction between "Arrow C++ compatible"
> > and
> > > > > "Arrow compatible".
> > > > >
> > > > > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney <wesmck...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > hi folks,
> > > > > >
> > > > > > I just made a comment in https://github.com/apache/arrow/pull/6026
> > > > > > that I wanted to surface here on the mailing list.
> > > > > >
> > > > > > It seems that to reach consensus for a C interface that is
> > intended to
> > > > > > be broadly used by multiple programming languages, we may make some
> > > > > > compromises that harm or outright undermine some of the use cases
> > that
> > > > > > motivated the creation of the C interface in the first place. That
> > > > > > does not seem good. I wonder if it would be more productive to
> > reduce
> > > > > > the scope of the project to merely providing a C-header-based data
> > > > > > interface to the C++ project only. That was the original problem
> > > > > > statement and it seems in attempting to make it useful beyond C++
> > has
> > > > > > made it difficult to reach consensus.
> > > > > >
> > > > > > Thanks
> > > > > > Wes
> > > > > >
> > > > > > On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <jacq...@apache.org
> > >
> > > > wrote:
> > > > > > >
> > > > > > > Thanks for addressing my comments. I'm actively reviewing the
> > > > proposal.
> > > > > > It
> > > > > > > is taking me more time than I would like given the time of the
> > year
> > > > but I
> > > > > > > want to make sure that you know that I'm looking at it and hope
> > to
> > > > > > provide
> > > > > > > additional feedback beyond that which I've provided thus far on
> > the
> > > > PR.
> > > > > > > Will update soon.
> > > > > > >
> > > > > > > Thanks for your patience.
> > > > > > >
> > > > > > > On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou <
> > solip...@pitrou.net
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > Following Jacques's feedback, I drafted a new version of the C
> > data
> > > > > > > > interface spec.
> > > > > > > >
> > > > > > > > The spec PR is here:
> > > > > > > > https://github.com/apache/arrow/pull/6040
> > > > > > > > Direct link to the RST file:
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst
> > > > > > > >
> > > > > > > > There is also a C++ implementation, together with a Python <->
> > R
> > > > > > > > bridge demonstrating the functionality:
> > > > > > > > https://github.com/apache/arrow/pull/6026
> > > > > > > >
> > > > > > > > The main change from the previous spec is that there are now
> > two C
> > > > > > > > structures; one for the type or schema information, one for the
> > > > > > > > array or record batch data. This allows exchanging both kinds
> > of
> > > > > > > > information independently (and so, potentially, to exchange
> > schema
> > > > once
> > > > > > > > and then multiple arrays or record batches).
> > > > > > > >
> > > > > > > > Comments and questions welcome.
> > > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Antoine.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: [DISCUSS] C Data Interface, take 2

Reply via email to