Re: [DISCUSS] How to describe computation on Arrow data?

Andy Grove Thu, 18 Mar 2021 07:17:30 -0700

Hi Paddy,

Thanks for raising this.


Ballista defines computations using protobuf [1] to describe logical and
physical query plans, which consist of operators and expressions. It is
actually based on the Gandiva protobuf [2] for describing expressions.

I see a lot of value in standardizing some of this across implementations.
Ballista is essentially becoming a distributed scheduler for Arrow and can
work with any implementation that supports this protobuf definition of
query plans.

It would also make it easier to embed C++ in Rust, or Rust in C++, having
this common IR, so I would be all for having something like this as an
Arrow specification.

Thanks,

Andy.

[1]
https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto
[2]
https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto


On Thu, Mar 18, 2021 at 7:40 AM paddy horan <[email protected]> wrote:

> Hi All,
>
> I do not have a computer science background so I may not be asking this in
> the correct way or using the correct terminology but I wonder if we can
> achieve some level of standardization when describing computation over
> Arrow data.
>
> At the moment on the Rust side DataFusion clearly has a way to describe
> computation, I believe that Ballista adds the ability to serialize this to
> allow distributed computation.  On the C++ side work is starting on a
> similar query engine and we already have Gandiva.  Is there an opportunity
> to define a kind of IR for computation over Arrow data that could be
> adopted across implementations?
>
> In this case DataFusion could easily incorporate Gandiva to generate
> optimized compute kernels if they were using the same IR to describe
> computation.  Applications built on Arrow could "describe" computation in
> any language and take advantage or innovations across the community, adding
> this to Arrow's zero copy data sharing could be a game changer in my mind.
> I'm not someone who knows enough to drive this forward but I obviously
> would like to get involved.  For some time I was playing around with using
> TVM's relay IR [1] and applying it to Arrow data.
>
> As the Arrow memory format has now matured I fell like this could be the
> next step.  Is there any plan for this kind of work or are we going to
> allow sub-projects to "go their own way"?
>
> Thanks,
> Paddy
>
> [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (apache.org)<
> https://tvm.apache.org/docs/dev/relay_intro.html>
>
>

Re: [DISCUSS] How to describe computation on Arrow data?

Reply via email to