Background
My team uses an expression computation library for our C++ feature
engineering pipeline. We currently use Exprtk
<https://github.com/ArashPartow/exprtk>. We recently tried out Gandiva and
wrote some benchmarks. We discovered that Gandiva is several times faster
than Exprtk in our use cases. Therefore we decided to switch to Gandiva for
computing expressions.
Objective
As of current, due to its lack of a frontend, we need to manually construct
an AST to use Gandiva. This is inconvenient and requires extra learning
costs. We also want to enable our ML engineers to dynamically create/update
an expression with runtime hot-loaded configs without restarting our
server. This is currently impossible with Gandiva because the Expression
tree is statically created with C++ and must be compiled in the server
binary.
Therefore, we would like to implement a parser frontend for Gandiva, so
that Gandiva becomes a standalone complete expression compiler and
evaluator, and a drop-in replacement for the existing libraries like Exprtk
<https://github.com/ArashPartow/exprtk> and TinyExpr
<https://github.com/codeplea/tinyexpr>. The goal is to enable the following
functionality:
// Create schema for gandiva
auto field_x = arrow::field("x", arrow::uint64());
auto field_y = arrow::field("y", arrow::float64());
auto field_z = arrow::field("z", arrow::boolean());
auto schema = arrow::schema({field_x, field_y, field_z});
/** BEGIN CHANGE **/
// Use the Parser to generate a NodePtr
std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
auto parser = gandiva::Parser(schema);
gandiva::NodePtr root_node;
auto status = parser.Parse(expr_str, &root_node);
/** END CHANGE **/
// The rest is normal usage of Gandiva projector
auto expr_ptr = std::make_shared<gandiva::Expression>(root_node, result_field);
std::shared_ptr<gandiva::Projector> projector;
auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);
auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
auto* pool = arrow::default_memory_pool();
arrow::ArrayVector outputs;
auto status = projector->Evaluate(*in_batch, pool, &outputs);
The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the proposed
usage of the parser. It offers two benefits:
1. It’s more intuitive to write math expressions compared to
constructing trees, thus easier to use.
2. It allows dynamically adding new expressions or and changing existing
ones with a runtime hot-loaded config file without restarting our server.
Syntax
The goal is to design a succinct and intuitive grammar for both schema and
Gandiva expressions. We will need a corresponding grammar for each Node
type in Gandiva.
-
Literals: We find Rust’s literal representation(
<https://doc.rust-lang.org/rust-by-example/types/literals.html>
https://doc.rust-lang.org/rust-by-example/types/literals.html) very
intuitive. We’ll support suffixes such as i32, u64, f32 to denote a
literal node’s type. The types of unsuffixed literals are inferred by their
usage. Otherwise, the default type for integers is int32 and float32 for
floating points. String and binary literals are wrapped with single or
double quotes. Decimal128 literals will not be supported in the first
version.
-
Fields: Just their names as defined in the schema. To avoid conflicts
with other node types, field names must start with alphabetical letters.
-
Functions: <function_name>(<param1>, <param2>, ...). For functions with
multiple overloads, their return type is inferred from input types. For
commonly used functions, we would also like to support infix forms. They
include:
- Comparisons: equal(==), not equal(!=), greater than(>), greater than
or equal to(>=), less than(<), less than or equal to(<=)
- Arithmetics: add(+), subtract(-), multiply(*), divide(/),
modulo(%), power(^), bitwise and(&), bitwise or(|), bitwise
xor(^), bitwise
not(~)
Function aliases with spaces in their names won’t be supported such as
“is not false” are not supported.
-
Ifs: We would like to support two grammars for if expressions:
- if(<cond>, <then>, <else>) for its simplicity and functional feel;
- if(<cond>) { <then> } else { <else> } since it’s the C++ if grammar
and has better formatting for complex expressions.
-
Booleans: We would like to support both && || and and or keywords the
same as C++.
-
InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is also
inferred.
The grammar can be roughly represented as:
exp := term
term := literal | field | function | if | boolean | inexpression
literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL |
FLOAT_LITERAL_SUFFIX | "-" literal
field := IDENTIFIER
function := infix_function | named_function
infix_function := "-" term | term "+" term | term "-" term ...
named_function := IDENTIFIER(arguments)
arguments := term | argument "," term
if := "if" "(" term "," term "," term ")" | "if" "(" term ")" "{"
term"}" ELSE "{" term "}"
boolean := and_boolean | or_boolean
and_boolean := term AND term | term "&&" term
or_boolean := term OR term | term "||" term
inexpression := term IN "(" arguments ")"
lower cases are non-terminals and upper cases are tokens.
Implementation Lexing and Parsing
We would like to use flex and bison for lexing and parsing. They have
several advantages compared to other options like Antlr4 and Boost::Spirit.
1. They are the most classical and popular parsing library in the cpp
world.
2. They allow us to precompile the parser codes and have no runtime
dependencies.
Flex&bison takes a string and outputs a gandiva::node tree. The tree may
contain untyped nodes, e.g., unsuffixed literals and functions.
Type Inference
We’ll have a TypeInferenceVisitor class that inherits node visitor,
implementing a 2-pass DFS algorithm to do type inference. In each pass, the
visitor tries to infer current node’s and its children’s types from
currently available info:
If it’s a non-leaf node such as function ,if ,boolean:
1. First visit each child: let them infer their own types as much as
they can.
2. Create a SignaturePattern based on currently known types. The pattern
includes the current node’s type and children’s types. The types can be
nullptr meaning the type is currently unknown. For example, the
SignaturePattern for func(x: u64, 5: untyped) will be (u64,
nullptr)—>nullptr.
3. Get all available signatures of the current node. For functions, it’s
the signatures registered at the function registry. For if, it’s (bool,
<type>, <type>)—><type> for any type etc.
4. Try to match each signature with the current pattern SignaturePattern
1. If no one matches, it’s a type error.
2. If only one matches, it’s the matched signature. We then update
current node’s and their children’s types based on this signature.
3. If more than one matches, we extract a common pattern from the set
of matched signatures. For example, (nullptr, nullptr)—>bool can be
extracted from (double, double)—>bool and (int32, int32)—>bool . We
then update types based on this common pattern.
If it’s a leaf node, just update its value if its type is set by its
parent. We need to do this because untyped literals’ values are saved in a
generic container.
We run this procedure 2 times, with the second pass a bit different. In the
second pass, if a literal node is still untyped, give it a default type (
int32 for ints and float32 for floats).
1. The first pass is bottom up propagation of types. A parent’s type is
inferred from its children.
2. The second pass is both:
1. top down propagation of first pass’ info. Once a parent’s type is
known in the first pass, it may set it’s children’s type
2. bottom up of default literal types. A literal is given a default
type.
In the second pass, since all leaf nodes’ types are known. The types of all
nodes are guaranteed to be inferred.
Proof of correctness
I have a proof of correctness of this inference procedure in mind, but it’s
too long to be written down here. (Fermat did it too 🙂) But the
correctness is based on these two facts:
1. There are no overloads that differ only on their return types. E.g.
there are no func := (int32, int32)-->int32 and func := (int32,
int32)-->double in the Gandiva registry. Therefore we can always know a
function’s type if its children’s types are known.
2. There are no overloads that accept only non-default numeric types.
For example, if there is a func := (int16, int16)-->int16 and func :=
(int64, int64)-->int64, but no func := (int32, int32)-->int32, then the
inference procedure will fail on expression func(1, 2) because it cannot
tell the types of 1 and 2, and giving them default types won't work.
Prototype and Examples
I’ve already written a mostly working prototype of the parser and type
inference visitor and unit tests for them. They are on the branch
<https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference>
https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference.
You can checkout the diffs here:
<https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference>
https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference
The main files are:
1. cpp/src/gandiva/grammar.yy: grammar rules for Bison.
2. cpp/src/gandiva/lex.ll: lex rules for Flex.
3. cpp/src/gandiva/typeinference.h/cc: type inference procedure.
4. cpp/src/gandiva/parser.cc: the driver class that combines the three
components.
5. cpp/src/gandiva/parser_test.cc: unit tests containing examples of the
proposed syntax and the result expression trees the parser generates. You
can run the tests by running cmake .. --preset=ninja-debug-gandiva and ninja
test-gandiva-tests.
Any suggestion/question is appreciated!
Best regards,
Jin Shang