Re: [DISCUSS] Splitting out the Arrow format directory

Phillip Cloud Fri, 13 Aug 2021 08:36:21 -0700

On Fri, Aug 13, 2021 at 8:03 AM Jorge Cardoso Leitão <
[email protected]> wrote:


> Hi,
>
> The requirements for the compute IR as I see it are:
> >
> > * Implementations in IR producer and consumer languages.
> > * Strongly typed or the ability to easily validate a payload
> >
>
> What about:
>
> 1. easy to read and write by a large number of programming languages
>

Personally, I do not care about the speed of IR processing right now.
Any non-trivial (and probably trivial too) computation done
by an IR consumer will dwarf the cost of IR processing. Of course,
we shouldn't prematurely pessimize either, but there's no reason
to spend time worrying about IR processing performance in my opinion (yet).


> 2. easy to read and write by humans
>

I think this is where I differ. Would you accept

"easy to transform into something that can be read and written by humans"

?

For example, you can turn a flatbuffer blob into its JSON equivalent using
a few command line flags passed to flatc.

That way, the IR can be flatbuffers, but if at any point someone wants to
look at
some other than a meaningless blob of bytes, they can.


> 3. fast to validate by a large number of programming languages
>

I guess it depends on what fast means here, as well as the programming
language
and implementation of the validator. In my view, this falls under "let's
not worry
worry about performance yet". To that point, I think a structured format
like
protobuf or flatbuffers let's us punt on performance for now. A
counter-argument
might be "if we're punting on performance, then why not pick the one that's
easiest
to debug?" My only answer to that is reuse of existing
flatbuffers types, which requires some work (at some point) to figure out
how to distribute the generated code. With JSON/TOML/YAML we would
have to build that. Maybe it's not a lot of effort, but I guess my
inclination is
to write more CI code, rather than library code if that's an option :)


>
> I.e. make the ability to read and write by humans be more important than
> speed of validation.


I think I differ on whether the IR should be easy to read and write by
humans.
IR is going to be predominantly read and written by machines, though of
course
we will need a way to inspect it for debugging.


>
> In this order, JSON/toml/yaml are preferred because they are supported by
> more languages and more human readable than faster to validate.
>
> -----
>
> My understanding is that for an async experience, we need the ability to
> `.await` at any `read_X` call so that if the read_X requests more bytes
> than are buffered, the `read_X(...).await` triggers a new (async) request
> to fill the buffer (which puts the future on a Pending state). When a
> library does not offer the async version of `read_X`, any read_X can force
> a request to fill the buffer, which is now blocking the thread. One way
> around this is to wrap those blocking calls in async (e.g. via
> tokio::spawn_blocking). However, this forces users to use that runtime, or
> to create a new independent thread pool for their own async work. Neither
> are great for low-level libraries.
>
>
I think I'm still missing something here.

You can asynchronously read arbitrary byte sequences from a wide variety
of IO sources and then parse the bytes into the desired format.

I don't follow why that isn't sufficient to take advantage of async.

A library like tonic for example, doesn't require that prost implement
async APIs (I still don't know what that would mean for an in-memory
format),
yet tonic takes full advantage of async. In fact, I think it's _only_ async.

I could understand the desire for a library to provide something like a
capital-S
Stream<Item = Message> where the bytes are consumed asynchronously. Is that
what you're after here?


> E.g. thrift does not offer async -> parquet-format-rs does not offer async
> -> parquet does not offer async -> datafusion wraps all parquet "IO-bounded
> and CPU-bounded operations" in spawn_blocking or something equivalent.


> Best,
> Jorge
>
>
> On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud <[email protected]> wrote:
>
> > On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão <
> > [email protected]> wrote:
> >
> > > I agree with Antoine that we should weigh the pros and cons of
> > flatbuffers
> > > (or protobuf or thrift for that matter) over a more human-friendly,
> > > simpler, format like json or MsgPack. I also struggle a bit to reason
> > with
> > > the complexity of using flatbuffers for this.
> > >
> >
> > Ultimately I think different representations of the format will emerge if
> > compute IR is successful,
> > and people will implement JSON/proto/thrift/etc versions of the IR.
> >
> > The requirements for the compute IR as I see it are:
> >
> > * Implementations in IR producer and consumer languages.
> > * Strongly typed or the ability to easily validate a payload
> >
> > It seems like Protobuf, Flatbuffers and JSON all meet the criteria here.
> > Beyond that,
> > there's precedence in the codebase for flatbuffers (which is just to say
> > that flatbuffers
> > is the devil we know).
> >
> > Can people list other concrete requirements for the format? A
> > non-requirement might
> > be that there be _idiomatic_ implementations for every language arrow
> > supports, for example.
> >
> > I think without agreement on requirements we won't ever arrive at
> > consensus.
> >
> > The compute IR spec itself doesn't really depend on the specific choice
> of
> > format, but we
> > need to get some consensus on the format.
> >
> >
> > > E.g. there is no async support for thrift, flatbuffers nor protobuf in
> > > Rust, which e.g. means that we can't read neither parquet nor arrow IPC
> > > async atm. These problems are usually easier to work around in simpler
> > > formats.
> > >
> >
> > Can you elaborate a bit on the lack of async support here and what it
> would
> > mean for
> > a particular in-memory representation to support async, and why that
> > prevents reading
> > a parquet file using async?
> >
> > Looking at JSON as an example, most libraries in the Rust ecosystem use
> > serde and serde_json
> > to serialize and deserialize JSON, and any async concerns occur at the
> > level of
> > a client/server library like warp (or some transitive dependency thereof
> > like Hyper).
> >
> > Are you referring to something like the functionality implemented in
> > tokio-serde-json? If so,
> > I think you could probably build something for these other formats
> assuming
> > they have serde
> > support (flatbuffers notably does _not_, partially because of its
> incessant
> > need to own everything),
> > since tokio_serde is doing most of the work in tokio-serde-json. In any
> > case, I don't think
> > it's a requirement for the compute IR that there be a streaming transport
> > implementation for the
> > format.
> >
> >
> > >
> > > Best,
> > > Jorge
> > >
> > >
> > >
> > > On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > > >
> > > > Le 12/08/2021 à 15:05, Wes McKinney a écrit :
> > > > > It seems that one adjacent problem here is how to make it simpler
> for
> > > > > third parties (especially ones that act as front end interfaces) to
> > > > > build and serialize/deserialize the IR structures with some kind of
> > > > > ready-to-go middleware library, written in a language like C++.
> > > >
> > > > A C++ library sounds a bit complicated to deal with for Java, Rust,
> Go,
> > > > etc. developers.
> > > >
> > > > I'm not sure which design decision and set of compromises would make
> > the
> > > > most sense.  But this is why I'm asking the question "why not JSON?"
> (+
> > > > JSON-Schema if you want to ease validation by third parties).
> > > >
> > > > (note I have already mentioned MsgPack, but only in the case a binary
> > > > encoding is really required; it doesn't have any other advantage
> that I
> > > > know of over JSON, and it's less ubiquitous)
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > > To do that, one would need the equivalent of arrow/type.h and
> related
> > > > > Flatbuffers schema serialization code that lives in arrow/ipc. If
> you
> > > > > want to be able to completely and accurately serialize Schemas, you
> > > > > need quite a bit of code now.
> > > > >
> > > > > One possible approach (and not go crazy) would be to:
> > > > >
> > > > > * Move arrow/types.h and its dependencies into a standalone C++
> > > > > library that can be vendored into the main apache/arrow C++
> library.
> > I
> > > > > don't know how onerous arrow/types.h's transitive dependencies /
> > > > > interactions are at this point (there's a lot of stuff going on in
> > > > > type.cc [1] now)
> > > > > * Make the namespaces exported by this library configurable, so any
> > > > > library can vendor the Arrow types / IR builder APIs privately into
> > > > > their project
> > > > > * Maintain this "Arrow types and ComputeIR library" as an always
> > > > > zero-dependency library to facilitate vendoring
> > > > > * Lightweight bindings in languages we care about (like Python or R
> > or
> > > > > GLib/Ruby) could be built to the IR builder middleware library
> > > > >
> > > > > This seems like what is more at issue compared with rather projects
> > > > > are copying the Flatbuffers files out of their project from
> > > > > apache/arrow or apache/arrow-format.
> > > > >
> > > > > [1]:
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
> > > > >
> > > > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb <[email protected]>
> > > > wrote:
> > > > >>
> > > > >> I support the idea of an independent repo that has the arrow
> > > flatbuffers
> > > > >> format definition files.
> > > > >>
> > > > >> My rationale is that the Rust implementation has a copy of the
> > > `format`
> > > > >> directory [1] and potential drift worries me (a bit). Having a
> > single
> > > > >> source of truth for the format that is not part of the large mono
> > repo
> > > > >> would be a good thing.
> > > > >>
> > > > >> Andrew
> > > > >>
> > > > >> [1] https://github.com/apache/arrow-rs/tree/master/format
> > > > >>
> > > > >> On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud <[email protected]>
> > > > wrote:
> > > > >>
> > > > >>> Hi all,
> > > > >>>
> > > > >>> I'd like to bring up an idea from a recent thread ([1]) about
> > moving
> > > > the
> > > > >>> `format/` directory out of the primary apache/arrow repository.
> > > > >>>
> > > > >>> I understand from that thread there are some concerns about using
> > > > >>> submodules,
> > > > >>> and I definitely sympathize with those concerns.
> > > > >>>
> > > > >>> In talking with David Li (disclaimer: we work together at Voltron
> > > > Data), he
> > > > >>> has
> > > > >>> a great idea that I think makes everyone happy: an
> > > > `apache/arrow-format`
> > > > >>> repository that is the official mirror for the flatbuffers IDL,
> > that
> > > > >>> library
> > > > >>> authors should use as the source of truth.
> > > > >>>
> > > > >>> It doesn't require a submodule, yet it also allows external
> > projects
> > > > the
> > > > >>> ability to access the IDL without having to interact with the
> main
> > > > arrow
> > > > >>> repository and is backwards compatible to boot.
> > > > >>>
> > > > >>> In this scenario, repositories that are currently copying in the
> > > > >>> flatbuffers
> > > > >>> IDL can migrate to this repository at their leisure.
> > > > >>>
> > > > >>> My motivation for this was around sharing data structures for the
> > > > compute
> > > > >>> IR
> > > > >>> proposal ([2]).
> > > > >>>
> > > > >>> I can think of at least two ways for IR producers and consumers
> of
> > > all
> > > > >>> languages to share the flatbuffers IDL:
> > > > >>>
> > > > >>> 1. A set of bindings built in some language that other languages
> > can
> > > > >>> integrate
> > > > >>>     with, likely C++, that allows library users to build IR using
> > an
> > > > API.
> > > > >>>
> > > > >>> The primary downside to this is that we'd have to deal with
> > > > >>> building another library while working out any kinks in the IR
> > design
> > > > and
> > > > >>> I'd
> > > > >>> rather avoid that in the initial phases of this project.
> > > > >>>
> > > > >>> The benefit is that IR components don't interact much with
> > > > `flatbuffers` or
> > > > >>> `flatc` directly.
> > > > >>>
> > > > >>> 2. A single location where the format lives, that doesn't require
> > > > depending
> > > > >>> on
> > > > >>>     a large multi-language repository to access a handful of
> files.
> > > > >>>
> > > > >>> I think the downside to this is that there's a bit of additional
> > > > >>> infrastructure
> > > > >>> to automate copying in `arrow-format`.
> > > > >>>
> > > > >>> The benefit there is that producers and consumers can immediately
> > > start
> > > > >>> getting
> > > > >>> value from compute IR without having to wait for development of a
> > new
> > > > API.
> > > > >>>
> > > > >>> One counter-proposal might be to just put the compute IR IDL in a
> > > > separate
> > > > >>> repo,
> > > > >>> but that isn't tenable because the compute IR needs arrow's type
> > > > >>> information
> > > > >>> contained in `Schema.fbs`.
> > > > >>>
> > > > >>> I was hoping to avoid conflating the discussion about bindings vs
> > > > direct
> > > > >>> flatbuffer usage (at least initially just supporting one, I
> predict
> > > > we'll
> > > > >>> need
> > > > >>> both ultimately) with the decision about whether to split out the
> > > > format
> > > > >>> directory, but it's a good example of a choice for which
> splitting
> > > out
> > > > the
> > > > >>> format directory would be well-served.
> > > > >>>
> > > > >>> I'll note that this doesn't block anything on the compute IR
> side,
> > > just
> > > > >>> wanted
> > > > >>> to surface this in a parallel thread and see what folks think.
> > > > >>>
> > > > >>> [1]:
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E
> > > > >>> [2]:
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l
> > > > >>>
> > > >
> > >
> >
>

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to