Thanks Jacques. I agree that none of the ways forward on this problem are wholly satisfactory. We should encourage users of this C API to prefer emitting byte-aligned / 0-offset in line with the IPC spec wherever possible. It will be interesting to see after a period of time how downstream projects are able to leverage this interface as part of their overall Arrow adoption.
On Tue, Jan 21, 2020 at 4:05 PM Jacques Nadeau <jacq...@apache.org> wrote: > > Upon further reflection (and as I've noted on the PR), I think merging the > ABI as a general feature of Arrow is preferable to making this be a > subinterface of the C++ part of the project. While the offset field is > awkward given its absence from the IPC spec, it's better to avoid > fragmenting the community based on that fields absence or existence. > > Thanks for the lively discussion Antoine, Wes and others! > > J > > On Mon, Jan 20, 2020 at 11:09 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > Independent of the particulars of the discussion, the C++ project > > needs to be free to create a C API for itself. If you want to try to > > block the C++ contributors from doing this we may be barreling toward > > a governance crisis in the project. I'm stepping back from this > > discussion for a time now to allow others to catch up on the > > discussion and to weigh in as needed > > > > On Mon, Jan 20, 2020 at 1:00 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > > > > I don't see this as an endogenous concern of the C++ project. I > > appreciate > > > your goal with saying so but I think this has broader ramifications > > around > > > fragmentation of the project. > > > > > > The core challenge that we're dealing with is we introduced foundational > > > concepts in some implementations that go beyond the spec and then > > provided > > > useful features based on them (in this case, the offset concept). > > Ideally, > > > those concepts are first introduced at the specification level so there > > > aren't inconsistent viewpoints of what Arrow is (which I believe is what > > is > > > happening here). Having a cross-language specification for in-memory > > > processing is a new concept so it isn't surprising that we're going to > > > learn these things along the way. > > > > > > Without this, we create a slippery slope of fragmentation between the > > > specifications and the implementations. I understand that the toothpaste > > is > > > out of the tube in this particular case. We can respond in two ways: stop > > > the slip or continue to slide down the slope. I'm inclined to stop the > > slip. > > > > > > As I said on the GitHub, I'm struggling with how much of this should be > > > solved in the project. I'm going to pause a bit on responding to reflect > > > further about this as well to reduce the likelihood that this devolves > > into > > > a flame war (which is always a risk with complex issues such as these). > > > > > > > > > > > > On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > hi Jacques, > > > > > > > > Taking a step back from the discussion, the original problem statement > > > > was to enable third party projects to produce the data structure used > > > > by C++ Array classes in C without depending on the C++ code > > > > > > > > That's the ArrayData class here > > > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232 > > > > > > > > It is important for us simplify the programming interface with the C++ > > > > library, so I think that we should address this as an endogenous > > > > concern of the C++ project, namely providing a "C API for the C++ > > > > project". The C API for the C++ library needs to mirror what's in the > > > > C++ project (i.e. the ArrayData data structure). We should not > > > > advertise this as being a part of the project specification. > > > > > > > > - Wes > > > > > > > > On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau <jacq...@apache.org> > > > > wrote: > > > > > > > > > > As I noted on the pull request, I think fundamentally this work is at > > > > odds > > > > > with the Arrow specification and being used to introduce a shadow > > > > > specification. > > > > > > > > > > I don't think our intentions about how people should use something > > really > > > > > influence how people will actually use or perceive it. They'll just > > find > > > > > supported Arrow code and expose things based on it and call it "Arrow > > > > > compatible". In other words, I don't think people in the outside > > world > > > > will > > > > > be able to perceive the distinction between "Arrow C++ compatible" > > and > > > > > "Arrow compatible". > > > > > > > > > > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney <wesmck...@gmail.com> > > > > wrote: > > > > > > > > > > > hi folks, > > > > > > > > > > > > I just made a comment in https://github.com/apache/arrow/pull/6026 > > > > > > that I wanted to surface here on the mailing list. > > > > > > > > > > > > It seems that to reach consensus for a C interface that is > > intended to > > > > > > be broadly used by multiple programming languages, we may make some > > > > > > compromises that harm or outright undermine some of the use cases > > that > > > > > > motivated the creation of the C interface in the first place. That > > > > > > does not seem good. I wonder if it would be more productive to > > reduce > > > > > > the scope of the project to merely providing a C-header-based data > > > > > > interface to the C++ project only. That was the original problem > > > > > > statement and it seems in attempting to make it useful beyond C++ > > has > > > > > > made it difficult to reach consensus. > > > > > > > > > > > > Thanks > > > > > > Wes > > > > > > > > > > > > On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <jacq...@apache.org > > > > > > > wrote: > > > > > > > > > > > > > > Thanks for addressing my comments. I'm actively reviewing the > > > > proposal. > > > > > > It > > > > > > > is taking me more time than I would like given the time of the > > year > > > > but I > > > > > > > want to make sure that you know that I'm looking at it and hope > > to > > > > > > provide > > > > > > > additional feedback beyond that which I've provided thus far on > > the > > > > PR. > > > > > > > Will update soon. > > > > > > > > > > > > > > Thanks for your patience. > > > > > > > > > > > > > > On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou < > > solip...@pitrou.net > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > Following Jacques's feedback, I drafted a new version of the C > > data > > > > > > > > interface spec. > > > > > > > > > > > > > > > > The spec PR is here: > > > > > > > > https://github.com/apache/arrow/pull/6040 > > > > > > > > Direct link to the RST file: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst > > > > > > > > > > > > > > > > There is also a C++ implementation, together with a Python <-> > > R > > > > > > > > bridge demonstrating the functionality: > > > > > > > > https://github.com/apache/arrow/pull/6026 > > > > > > > > > > > > > > > > The main change from the previous spec is that there are now > > two C > > > > > > > > structures; one for the type or schema information, one for the > > > > > > > > array or record batch data. This allows exchanging both kinds > > of > > > > > > > > information independently (and so, potentially, to exchange > > schema > > > > once > > > > > > > > and then multiple arrays or record batches). > > > > > > > > > > > > > > > > Comments and questions welcome. > > > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >