On Mon, Jul 6, 2020 at 2:45 PM Wes McKinney <wesmck...@gmail.com> wrote:
> I would also be interested in having a reusable serialized format for > filter- and projection-like expressions. I think trying to go so far > as full logical query plans suitable for building a SQL engine is > perhaps a bit too far but we could start small with the use case from > the JNI Datasets PR as a motivating example. We should also consider > replacing or deprecating Gandiva's serialized expressions in favor of > something more general. > Gandiva's representation was very much focused on being general. It attempts to be a generalized way to express what is known in Calcite as RowExpressions (or Rex for short). This is separate from relational expressions. It was built after having both defined a language independent graph representation for early versions of Drill [1] and then working extensively with Calcite and various types of expression transformations. I don't think a second representation should be considered until there is very strong proof that the Gandiva approach is somehow architecturally limited and/or cannot be modified/improved to solve new needs. > It may be a slight bikeshed issue, but I wouldn't be thrilled about > having this be based on Protocol Buffers, because of the runtime > requirement (on libprotobuf.so / libprotobuf.a) it introduces into C++ > applications. Flatbuffers might be less pleasant developer UX in Java > but at least in C++ the fact that Flatbuffers results in zero build- > or runtime dependencies is a significant advantage. > There is a canonical JSON representation of all protobuf. Using that representation requires zero build and runtime dependencies. One of the strengths of the Arrow community is the combination of SQL and data science expertise. The Gandiva work is used in production within a SQL engine on 1000s of nodes everyday (and within a Java context). Introducing a second row expression approach feels like NIH. Especially when the first use case is trying to pass row-wise expressions between Java and C++ (which is exactly what the Gandiva expression syntax was built to do). I'm against extending use of flatbuf within Arrow. The language support is too weak. Language support isn't just about having a binding for different languages, it is about having a high-quality binding.