Re: [DISCUSS] Splitting out the Arrow format directory

Keith Kraus Fri, 13 Aug 2021 11:57:58 -0700

> Personally, I do not care about the speed of IR processing right now.
> Any non-trivial (and probably trivial too) computation done
> by an IR consumer will dwarf the cost of IR processing. Of course,
> we shouldn't prematurely pessimize either, but there's no reason
> to spend time worrying about IR processing performance in my opinion
(yet).


In other processing engines I've seen situations somewhat commonly where
the time to build the compute graph becomes non-negligible and even more
expensive than doing the computation itself. I've even seen situations
where attempts were made to iteratively build a graph while executing in
order to try to overlap the cost of building the graph with the compute
execution.

There's been a huge amount of effort put into optimizing critical kernel
components like the hash table implementation in order to make Arrow the
most performant analytical library possible. Architecting and designing the
IR implementation without performance in mind from the beginning could
potentially put us into a difficult situation later that we'd have to
invest considerably more effort to work our way out of.

On Fri, Aug 13, 2021 at 2:30 PM Weston Pace <weston.p...@gmail.com> wrote:

> I believe you would need a JSON compatible version of the type system
> (including binary values) because you'd need to at least encode
> literals.  However, I don't think that creating a human readable
> encoding of the Arrow type system is a bad thing in and of itself.  We
> have tickets and get questions occasionally asking for a JSON format.
> This could at least be a step in that direction.  I don't think you'd
> need to add support for arrays/batches/tables.  Note, the C++
> implementation has a JSON format that is used for testing purposes
> (though I do not believe it is comprehensive).
>
> I think we could add two (potentially conflicting) requirements
>  * Low barrier to entry for consumers
>  * Low barrier to entry for producers
>
> JSON/YAML seem to lower the barrier to entry for producers.  Some
> producers may not even be working with Arrow data (e.g. could one go
> from SQL-literal -> JSON-literal skipping an intermediate
> Arrow-literal step?).  I think we've also dismissed Antoine's earlier
> point which I found the most compelling.  Handling flatbuffers adds
> one more step that people have to integrate into their build systems.
>
> Flatbuffers on the other hand lowers the barrier to entry for
> consumers.  A consumer is likely already going to have flatbuffers
> support built in so that they can read/write IPC files.  If we adopt
> JSON then the consumer will have to add support for a new file format
> (or at least part of one).
>
> On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn <quinn.jac...@gmail.com>
> wrote:
> >
> > >
> > > I just thought of one other requirement: the format needs to support
> > > arbitrary byte sequences.
> > >
> > Can you clarify why this is needed? Is it that custom_metadata maps
> should
> > allow byte sequences as values?
> >
> > On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud <cpcl...@gmail.com>
> wrote:
> >
> > > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > >
> > > >
> > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> > > > >
> > > > >> I.e. make the ability to read and write by humans be more
> important
> > > than
> > > > >> speed of validation.
> > > > >
> > > > > I think I differ on whether the IR should be easy to read and
> write by
> > > > > humans.
> > > > > IR is going to be predominantly read and written by machines,
> though of
> > > > > course
> > > > > we will need a way to inspect it for debugging.
> > > >
> > > > But the code executed by machines is written by humans.  I think
> that's
> > > > mostly where the contention resides: is it easy to code, in any given
> > > > language, the routines required to produce or consume the IR?
> > > >
> > >
> > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to
> use in
> > > any language except C++,
> > > and it's borderline annoying there too. Protobuf is similar (less
> annoying
> > > in Rust,
> > > but still annoying in Python and C++ IMO), though I think any binary
> format
> > > is going to be
> > > less human-friendly, by construction.
> > >
> > > If we were to use something like JSON or msgpack, can someone sketch
> out
> > > the interaction
> > > between the IR and the rest of arrow's type system?
> > >
> > > Would we need a JSON-encoded-arrow-type -> in-memory representation
> for an
> > > Arrow type in a given language?
> > >
> > > I just thought of one other requirement: the format needs to support
> > > arbitrary byte sequences. JSON
> > > doesn't support untransformed byte sequences, though it's not uncommon
> to
> > > base64-encode a byte sequence.
> > > IMO that adds an unnecessary layer of complexity, which is another
> tradeoff
> > > to consider.
> > >
>

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to