[
https://issues.apache.org/jira/browse/ARROW-17668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605508#comment-17605508
]
Kouhei Sutou commented on ARROW-17668:
--------------------------------------
Could you post this proposal to {{[email protected]}} to get more feedback?
> [C++][Gandiva] Add parser frontend for Gandiva
> ----------------------------------------------
>
> Key: ARROW-17668
> URL: https://issues.apache.org/jira/browse/ARROW-17668
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++ - Gandiva
> Reporter: Jin Shang
> Assignee: Jin Shang
> Priority: Minor
>
> This issue is a feature proposal to add a parser frontend for Gandiva.
> h2. Background
> My team uses an expression computation library for our C++ feature
> engineering pipeline. We currently use
> [Exprtk]([https://github.com/ArashPartow/exprtk]). We recently tried out
> Gandiva and wrote some benchmarks. We discovered that Gandiva is several
> times faster than Exprtk in our use cases. Therefore we decided to switch to
> Gandiva for computing expressions.
> h2. Objective
> As of current, due to its lack of a frontend, we need to manually construct
> an AST to use Gandiva. This is inconvenient and requires extra learning cost.
> We would like to implement a parser frontend for Gandiva, so that Gandiva
> becomes a standalone complete expression compiler and evaluator, and a
> drop-in replacement for the existing libraries like like
> [Exprtk]([https://github.com/ArashPartow/exprtk]) and
> [TinyExpr]([https://github.com/codeplea/tinyexpr]). The goal is to enable the
> following functionality:
>
> {code:cpp}
> // Create schema for gandiva
> auto field_x = arrow::field(x, arrow::int32());
> auto field_y = arrow::field(y, arrow::float64());
> auto field_z = arrow::field(z, arrow::boolean());
> auto schema = arrow::scehma({field_x, field_y, field_z});
> // Use the Parser to generate an ExpressionPtr
> std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
> auto parser = gandiva::Parser();
> std::shared_ptr<gandiva::ExpressionPtr> expr_ptr;
> auto status = parser.Parse(schema, expr_str, &expr_ptr);
> // The rest is normal usage of Gandiva projector
> std::shared_ptr<gandiva::Projector> projector;
> auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);
> auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
> auto* pool = arrow::default_memory_pool();
> arrow::ArrayVector outputs;
> auto status = projector->Evaluate(*in_batch, pool, &outputs);
> {code}
>
> h2. Syntax
> The goal is to design a succinct and intuitive grammar for both schema and
> Gandiva expressions. We will need a corresponding grammar for each Node type
> in Gandiva.
> - Literals: We find Rust’s literal
> representation([https://doc.rust-lang.org/rust-by-example/types/literals.html]([https://doc.rust-lang.org/rust-by-example/types/literals.html]))
> very intuitive. We’ll support suffix such as `i32`, `u64`, `f32` to denote a
> literal’s type. The types of unsuffixed literals are inferred by their usage.
> Otherwise, the default type for integers is `int32` and `float32` for
> floating points. String and binary literals are wrapped with single or double
> quotes. Decimal128 literals will not be supported in the first version.
> - Fields: Just their names as defined in the schema. To avoid conflicts with
> other node types, field names must start with alphabetical letters.
> - Functions: `<function_name>(<param1>, <param2>, ...)`. For functions with
> multiple overloads, their return type is inferred from input types. For
> commonly used functions, we would also like to support infix forms. They
> include:
> - Comparisons: equal(==), not equal(!=), greater than(>), greater than or
> equal to(>=), less than(<), less than or equal to(<=)
> - Arithmetics: add( +), substract( -), multiply( *), divide( /),
> modulo(%), power({^}), bitwise and(&), bitwise or(|), bitwise xor({^}),
> bitwise not(~)
>
> Function alias with spaces in their names won’t be supported such as “is
> not false” are not supported.
>
> - If: We could like to support two grammars for if expressions:
> - `if(<cond>, <then>, <else>)` for its simplicity and functional feel;
> - `if(<cond>) \{ <then> } else \{ <else> }` since it’s the C++ `if`
> grammar and has better formatting for complex expressions.
> - Boolean: We would like to support both `&& ||` and `and or` keywords same
> as C++.
> - InExpression: `<eval> in (<member1>, <member2>, ...)` . Its type is also
> inferred.
> The grammar can be roughly represented as:
> {code:cpp}
> exp := literal | field | function | if | boolean | inexpression
> literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL |
> FLOAT_LITERAL_SUFFIX | "-" literal
> field := IDENTIFIER
> function := IDENTIFIER(arguments) | infixfunc
> infixfunc := exp "+" exp | exp "-" exp | ...
> arguments := exp | argument "," exp
> if := IF "(" exp "," exp "," exp ")" | "if" "(" exp ")" "{" exp "}" ELSE "{"
> exp "}"
> boolean := and_boolean | or_boolean
> and_boolean := exp AND exp | exp "&&" exp
> or_boolean := exp OR exp | exp "||" exp
> inexpression := exp IN "(" arguments ")"{code}
> lower cases are non-terminals and upper cases are tokens.
> h2. Implementation
> h3. Lexing and Parsing
> We would like to use flex and bison for lexing and parsing. They have several
> advantages compared to other options like Antlr4 and Boost::Spirit.
> # They are the most classical and popular parsing library in the cpp world.
> # They allow us to precompile the parser codes and have no runtime
> dependencies.
> Flex&bison takes a string and outputs a `gandiva::node` tree. The tree may
> contain untyped nodes, e.g., unsuffixed literals and functions.
> h3. Type Inference
> We’ll have a TypeInference class that inherits node visitor, implementing a
> simple DFS algorithm to do type inference.
> For functions,
> # Get all available signatures of this function from the function registry.
> # Find all candidate signatures that match the current known argument and
> return type.
> ## If none of them matches, there is a type error.
> ## If only one matches, this is the signature. From the signature we also
> know the type of each argument. Visit each argument node carrying their
> return type.
> ## If more than one matches, visit each argument node first. Then we do
> matching again. It is guaranteed to have zero or one matches because no
> function has overloads that differ only on their return types, because at
> this point each argument’s type is known (because untyped literals are given
> default types upon visit, see below).
> ### EDIT: After this process, we need to revisit each arg node again.
> Consider the example expression "add(add(x, 1), add(2, 3))" where x is a u64
> field, after the first traversal we know that add(x,1) is of type u64 and
> thus the whole expression is of type u64. We need to carry this info to
> add(2,3), otherwise the node's type is still unknown.
> The process is similar for `if`, `boolean`and `in expression` , only their
> “signatures” are manually constructed. For example `if` has `(bool, <type>,
> <type>)–> <type>`and `boolean` has `(bool, bool)–>bool`.
> For untyped literals,
> # If a type is passed from its parent, the type is known.
> # Otherwise set it to the default type.
> Any feedback is appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)