ozancicek opened a new pull request #24939:
URL: https://github.com/apache/spark/pull/24939
## What changes were proposed in this pull request?
Added support for arithmetic expressions, use of spark functions and
registered udf's inside formula so that expressions like this work as intended
in `RFormula`;
`log(y) ~ a + pow(b, 2)` -> `log()` and `pow()` spark functions are used
on `y` and `b`
`I(y +b) ~ a + x` -> The label term is sum of `y` and `b`
Udf's can also be used once they're registered:
```scala
val registeredUdf = spark.udf.register("plusTwo", (x: Int) => (x + 2))
val formula = new RFormula()
.setFormula("plusTwo(y) ~ a + plusTwo(b)")
val df = Seq((1, 4, 4), (2, 5, 6)).toDF("y", "a", "b")
val model = formula.fit(df)
model.transform(df).show()
+---+---+---+---------+-----+
| y| a| b| features|label|
+---+---+---+---------+-----+
| 1| 4| 4|[4.0,6.0]| 3.0|
| 2| 5| 6|[5.0,8.0]| 4.0|
+---+---+---+---------+-----+
```
Summary of changes:
- Added `EvalExprParser` trait for parsing arithmetic expressions inside
formula
- Added `ExprSelector` transformer which adds columns to a dataframe using
`expr` spark function
- Add `ExprSelector` and `ColumnPruner` stages for parsed arithmetic
expressions.
The criteria for parsing a term as arithmetic expression is that it's inside
'I(' expression ')', or it's a function (function criteria is ascii chars +
alphanumeric chars + '(' args ')' ). From parsed arithmetic expressions,
features are generated using `expr` spark function, so anything that can be
used with `expr` function should be valid. Due to many ways `expr` function can
be used, whether the expression has a valid syntax or whether the function is
defined or not is not checked, only parsing rule is balanced parentheses.
## How was this patch tested?
Unit tests to RFormulaParserSuite and RFormulaSuite
@srowen @felixcheung
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]