Hi Paddy, Thanks for raising this.
Ballista defines computations using protobuf [1] to describe logical and physical query plans, which consist of operators and expressions. It is actually based on the Gandiva protobuf [2] for describing expressions. I see a lot of value in standardizing some of this across implementations. Ballista is essentially becoming a distributed scheduler for Arrow and can work with any implementation that supports this protobuf definition of query plans. It would also make it easier to embed C++ in Rust, or Rust in C++, having this common IR, so I would be all for having something like this as an Arrow specification. Thanks, Andy. [1] https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto [2] https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com> wrote: > Hi All, > > I do not have a computer science background so I may not be asking this in > the correct way or using the correct terminology but I wonder if we can > achieve some level of standardization when describing computation over > Arrow data. > > At the moment on the Rust side DataFusion clearly has a way to describe > computation, I believe that Ballista adds the ability to serialize this to > allow distributed computation. On the C++ side work is starting on a > similar query engine and we already have Gandiva. Is there an opportunity > to define a kind of IR for computation over Arrow data that could be > adopted across implementations? > > In this case DataFusion could easily incorporate Gandiva to generate > optimized compute kernels if they were using the same IR to describe > computation. Applications built on Arrow could "describe" computation in > any language and take advantage or innovations across the community, adding > this to Arrow's zero copy data sharing could be a game changer in my mind. > I'm not someone who knows enough to drive this forward but I obviously > would like to get involved. For some time I was playing around with using > TVM's relay IR [1] and applying it to Arrow data. > > As the Arrow memory format has now matured I fell like this could be the > next step. Is there any plan for this kind of work or are we going to > allow sub-projects to "go their own way"? > > Thanks, > Paddy > > [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (apache.org)< > https://tvm.apache.org/docs/dev/relay_intro.html> > >