Re: [DISCUSS] C-level in-process array protocol

Jacques Nadeau Sun, 29 Sep 2019 10:59:56 -0700

On Sun, Sep 29, 2019 at 12:59 AM Antoine Pitrou <anto...@python.org> wrote:


>
> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
> > * No dependency on Flatbuffers.
> > * No buffer reassembly (data is already exposed in logical Arrow format).
> > * Zero-copy by design.
> > * Easy to reimplement from scratch.
> >
> > I don't see how the flatbuffer pattern for data headers doesn't
> accomplish
> > all of these things. At its definition, is a very simple representation
> of
> > data that could be worked with independently of the flatbuffers codebase.
> > It was designed so systems could map directly into that memory without
> > interacting with a flatbuffers library.
> >
> > Specifically the following three structures were designed to already
> allow
> > what I think this proposal is trying to recreate. All three are very
> simple
> > to construct in a direct, non-flatbuffer dependent read/write pattern.
>
> Are they?  Personally, I wouldn't know how to do that.  I don't know
> which encoding Flatbuffers use, whether it's C ABI-compatible (how could
> it be? if it's portable accross different platforms, then it's probably
> not compatible with any particular platform's C ABI, or only as a
> conincidence), how I'm supposed to make use of the "offset" field, or
> what the lifetime / ownership of all this data is.
>
> I may be missing something, but if the answer is that it's easy to
> reimplement Flatbuffers' encoding without relying on the Flatbuffers
> project's source code, I'm a bit skeptical.
>
> Regards
>
> Antoine.
>


You're talking about three separate things:
1) How do I communicate schema
2) How do I communicate encoded batches with zero copy
3) How do I define rules around ownership semantics

Flatbuffers isn't trying to solve #3 but it is definitely trying to solve 1
& 2. Flatbuffers is a formal in memory specification of how data is layed
out, just like Arrow. While some constructs or more complex to construct,
the record batch construct was designed specifically to avoid those
patterns. That's why we only use structs within that construct.

"...it's easy to reimplement Flatbuffers' encoding...I'm a bit skeptical"

Let's evaluate that path before dismissing it.
https://google.github.io/flatbuffers/md__internals.html

If I remember correctly, the structure of a flatbuffers record batch is:

[int8-]
[uint4-number of field nodes]
  [int8-length][int8-null count]
  [int8-length][int8-null count]
  ...
[uint4-number of buffers]
  [int8-memory address][int8-length of buffer]
  [int8-memory address][int8-length of buffer]
  ...

The goal of offset was offset in the memory space you are within. In a
single memory space, that is equivalent to memory address. Given a known
schema, you have a known address for all other accesses across all batches
without any specialized reading.

It seems like you're saying: "flatbuffers is too complex an encoding, let's
create a new encoding". I agree for #1 but don't agree for #2 and think we
should do our best to avoid an alternative for #2. We don't have something
that defines #3 so that is clearly a gap. The definition should be done
with all languages in mind. I also am generally against implementing a new
terse encoding method and specialized parsing.

To introduce this, I also think we need to have reference
tools/implementation/integration tests for both C++ and Java.

Re: [DISCUSS] C-level in-process array protocol

Reply via email to