Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

Wes McKinney Tue, 17 Sep 2019 16:48:27 -0700

hi Micah,

On Mon, Sep 16, 2019 at 11:36 PM Micah Kornfield <[email protected]> wrote:
>
> 1.  Are there particular issues that have cropped up that we should be
> aware of?  This might help inform how we go about this.


A simplified version of what is beginning to happen is the following.

A thirdparty project writes down the following structs

// Implement data structures following the Arrow specification

struct Array {
  int64_t length;
  int64_t null_count;
  const uint8_t* valid_bits;
};

struct Int32Array : public Array {
  const uint8_t* data;
};

struct BinaryArray : public Array {
  const int32_t* offsets;
  const uint8_t* data;
};

etc.

Then there are some number of functions that create and act on these
data structures.

The question then is what can be asserted to users. If the statement
is "since this follows the Arrow spec you can benefit from other
projects that also use Arrow" I am concerned about what conclusions
users may draw from this. The details of how different projects that
"use Arrow" *actually* interoperate with each other may not be known
well except to insiders.

> 2.  We should be publishing a matrix of current compliance with the
> standard for our existing implementations (this could be the basis of
> letting bespoke implementations clarify what they support).

Yes, I agree this would be helpful to have a table in the
documentation that implementations can keep up to date

> 3.  I'm not sure I understand the exact conclusion one should draw by
> answering the three questions that are posed above.  People can be using
> the one of the core Arrow implementations and still be using it incorrectly
> which would cause bugs.  Similarly, I'm not sure as an end-user what
> conclusion I should draw from "some level of" native arrow based processing?
>

Let's take an example:

* Dremio can execute SQL and uses Arrow as its native runtime format
* Apache Spark can execute SQL and offers UDF support with Arrow
format, i.e. so using Arrow for IO

Both of these projects can say that they "use Apache Arrow", but the
extent to which Arrow is a key ingredient may not be obvious to the
average onlooker. To have more "Arrow-native" systems seems like one
of the missions of the project.

> Thanks,
> Micah
>
>
> On Mon, Sep 16, 2019 at 7:22 AM Wes McKinney <[email protected]> wrote:
>
> > hi folks,
> >
> > As Apache Arrow grows more popular, we may acquire some different
> > kinds of third party developers:
> >
> > A. Developers who use and, in many cases, contribute to one of the
> > project's reference implementations
> >
> > B. Developers who choose to implement the columnar format themselves,
> > without depending on any reference implementation
> >
> > There's nothing we can do to stop Category B developers, and in some
> > cases building an bespoke implementation may be the correct move.
> >
> > I'm concerned about the case of incomplete implementations that are
> > advertised as "using Arrow", "following the Arrow specification", or
> > "Arrow-compatible". An implementation is considered incomplete if it
> > does not pass the muster of our binary integration test suite (we will
> > eventually need to make this easier to run on third party libraries:
> > https://issues.apache.org/jira/browse/ARROW-6571).
> >
> > If an implementation does not have integration tests to prove
> > compliance, then advertisements regarding its level of compatibility
> > or trueness to the specification may mislead users. Problems that
> > arise from these situations may result in harm to the Arrow
> > community's reputation through no fault of our own.
> >
> > Since we can't force third parties to use any of the Arrow community's
> > code artifacts, one idea is to develop some form of "grading" system
> > to enable projects to self-report the nature of their use of the Arrow
> > columnar format to help answer such questions as:
> >
> > * Do you use a fully integration-tested implementation (e.g. I am only
> > aware of 4 such libraries at the moment -- our reference libraries in
> > C++, Java, JavaScript, and Go -- I understand that C# and Rust will
> > get there eventually)?
> > * If your project "supports Arrow" does that mean just "can serialize
> > data to/from Arrow" or something more?
> > * Does your project feature some level of "native" Arrow-based processing?
> >
> > A linear grading scale may not make sense, but having clear answers to
> > some of these questions in downstream projects' documentation would be
> > helpful.
> >
> > As Apache Arrow's brand grows and value, more and more projects will
> > use the brand in a "Powered By" way, and so I think it's important
> > that we help projects clearly communicate to their users to what
> > extent they employ the project.
> >
> > Thanks,
> > Wes
> >

Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

Reply via email to