[ 
https://issues.apache.org/jira/browse/ARROW-17668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605508#comment-17605508
 ] 

Kouhei Sutou commented on ARROW-17668:
--------------------------------------

Could you post this proposal to {{[email protected]}} to get more feedback?

> [C++][Gandiva] Add parser frontend for Gandiva
> ----------------------------------------------
>
>                 Key: ARROW-17668
>                 URL: https://issues.apache.org/jira/browse/ARROW-17668
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++ - Gandiva
>            Reporter: Jin Shang
>            Assignee: Jin Shang
>            Priority: Minor
>
> This issue is a feature proposal to add a parser frontend for Gandiva.
> h2. Background
> My team uses an expression computation library for our C++ feature 
> engineering pipeline. We currently use 
> [Exprtk]([https://github.com/ArashPartow/exprtk]). We recently tried out 
> Gandiva and wrote some benchmarks. We discovered that Gandiva is several 
> times faster than Exprtk in our use cases. Therefore we decided to switch to 
> Gandiva for computing expressions. 
> h2. Objective
> As of current, due to its lack of a frontend, we need to manually construct 
> an AST to use Gandiva. This is inconvenient and requires extra learning cost. 
> We would like to implement a parser frontend for Gandiva, so that Gandiva 
> becomes a standalone complete expression compiler and evaluator, and a 
> drop-in replacement for the existing libraries like like 
> [Exprtk]([https://github.com/ArashPartow/exprtk]) and 
> [TinyExpr]([https://github.com/codeplea/tinyexpr]). The goal is to enable the 
> following functionality:
>  
> {code:cpp}
> // Create schema for gandiva
> auto field_x = arrow::field(x, arrow::int32());
> auto field_y = arrow::field(y, arrow::float64());
> auto field_z = arrow::field(z, arrow::boolean());
> auto schema = arrow::scehma({field_x, field_y, field_z});
> // Use the Parser to generate an ExpressionPtr
> std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
> auto parser = gandiva::Parser();
> std::shared_ptr<gandiva::ExpressionPtr> expr_ptr;
> auto status = parser.Parse(schema, expr_str, &expr_ptr);
> // The rest is normal usage of Gandiva projector
> std::shared_ptr<gandiva::Projector> projector;
> auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);
> auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
> auto* pool = arrow::default_memory_pool();
> arrow::ArrayVector outputs;
> auto status = projector->Evaluate(*in_batch, pool, &outputs);
> {code}
>  
> h2. Syntax
> The goal is to design a succinct and intuitive grammar for both schema and 
> Gandiva expressions. We will need a corresponding grammar for each Node type 
> in Gandiva.
>  - Literals: We find Rust’s literal 
> representation([https://doc.rust-lang.org/rust-by-example/types/literals.html]([https://doc.rust-lang.org/rust-by-example/types/literals.html]))
>  very intuitive. We’ll support suffix such as `i32`, `u64`, `f32` to denote a 
> literal’s type. The types of unsuffixed literals are inferred by their usage. 
> Otherwise, the default type for integers is `int32` and `float32` for 
> floating points. String and binary literals are wrapped with single or double 
> quotes. Decimal128 literals will not be supported in the first version.
>  - Fields: Just their names as defined in the schema. To avoid conflicts with 
> other node types, field names must start with alphabetical letters.
>  - Functions: `<function_name>(<param1>, <param2>, ...)`. For functions with 
> multiple overloads, their return type is inferred from input types. For 
> commonly used functions, we would also like to support infix forms. They 
> include:
>     - Comparisons: equal(==), not equal(!=), greater than(>), greater than or 
> equal to(>=), less than(<), less than or equal to(<=)
>     - Arithmetics: add( +), substract( -), multiply( *), divide( /), 
> modulo(%), power({^}), bitwise and(&), bitwise or(|), bitwise xor({^}), 
> bitwise not(~)
>     
>     Function alias with spaces in their names won’t be supported such as “is 
> not false” are not supported.
>     
>  - If: We could like to support two grammars for if expressions:
>     - `if(<cond>, <then>, <else>)` for its simplicity and functional feel;
>     - `if(<cond>) \{ <then> } else \{ <else> }` since it’s the C++ `if` 
> grammar and has better formatting for complex expressions.
>  - Boolean: We would like to support both `&& ||` and `and or` keywords same 
> as C++.
>  - InExpression: `<eval> in (<member1>, <member2>, ...)` . Its type is also 
> inferred.
> The grammar can be roughly represented as:
> {code:cpp}
> exp := literal | field | function | if | boolean | inexpression
> literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL | 
> FLOAT_LITERAL_SUFFIX | "-" literal
> field := IDENTIFIER
> function := IDENTIFIER(arguments) | infixfunc
> infixfunc := exp "+" exp | exp "-" exp | ...
> arguments := exp | argument "," exp
> if := IF "(" exp "," exp "," exp ")" | "if" "(" exp ")" "{" exp "}" ELSE "{" 
> exp "}"
> boolean := and_boolean | or_boolean
> and_boolean := exp AND exp | exp "&&" exp
> or_boolean := exp OR exp | exp "||" exp
> inexpression := exp IN "(" arguments ")"{code}
> lower cases are non-terminals and upper cases are tokens.
> h2. Implementation
> h3. Lexing and Parsing
> We would like to use flex and bison for lexing and parsing. They have several 
> advantages compared to other options like Antlr4 and Boost::Spirit.
>  # They are the most classical and popular parsing library in the cpp world.
>  # They allow us to precompile the parser codes and have no runtime 
> dependencies.
> Flex&bison takes a string and outputs a `gandiva::node` tree. The tree may 
> contain untyped nodes, e.g., unsuffixed literals and functions. 
> h3. Type Inference
> We’ll have a TypeInference class that inherits node visitor, implementing a 
> simple DFS algorithm to do type inference. 
> For functions, 
>  # Get all available signatures of this function from the function registry.
>  # Find all candidate signatures that match the current known argument and 
> return type.
>  ## If none of them matches, there is a type error.
>  ## If only one matches, this is the signature. From the signature we also 
> know the type of each argument. Visit each argument node carrying their 
> return type.
>  ## If more than one matches, visit each argument node first. Then we do 
> matching again. It is guaranteed to have zero or one matches because no 
> function has overloads that differ only on their return types, because at 
> this point each argument’s type is known (because untyped literals are given 
> default types upon visit, see below). 
>  ### EDIT: After this process, we need to revisit each arg node again. 
> Consider the example expression "add(add(x, 1), add(2, 3))" where x is a u64 
> field, after the first traversal we know that add(x,1) is of type u64 and 
> thus the whole expression is of type u64. We need to carry this info to 
> add(2,3), otherwise the node's type is still unknown.
> The process is similar for `if`, `boolean`and `in expression` , only their 
> “signatures” are manually constructed. For example `if` has `(bool, <type>, 
> <type>)–> <type>`and `boolean` has `(bool, bool)–>bool`.
> For untyped literals,
>  # If a type is passed from its parent, the type is known.
>  # Otherwise set it to the default type.
> Any feedback is appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to