Jonathon Lee created SPARK-48719:
------------------------------------
Summary: Wrong Result in regr_slope®r_intercept Aggregate with
Tuples has NULL
Key: SPARK-48719
URL: https://issues.apache.org/jira/browse/SPARK-48719
Project: Spark
Issue Type: Bug
Components: Spark Core, SQL
Affects Versions: 3.4.0
Reporter: Jonathon Lee
When calculate slope and intercept using regr_slope & regr_intercept aggregate:
(using Java api)
{code:java}
spark.sql("drop table if exists tab");
spark.sql("CREATE TABLE tab(y int, x int) using parquet");
spark.sql("INSERT INTO tab VALUES (1, 1)");
spark.sql("INSERT INTO tab VALUES (2, 3)");
spark.sql("INSERT INTO tab VALUES (3, 5)");
spark.sql("INSERT INTO tab VALUES (NULL, 3)");
spark.sql("INSERT INTO tab VALUES (3, NULL)");
spark.sql("SELECT " +
"regr_slope(x, y), " +
"regr_intercept(x, y)" +
"FROM tab").show(); {code}
Spark result:
{code:java}
+------------------+--------------------+
| regr_slope(x, y)|regr_intercept(x, y)|
+------------------+--------------------+
|1.4545454545454546| 0.09090909090909083|
+------------------+--------------------+ {code}
The correct answer should be 2.0 and -1.0 obviously.
Reason:
In sql/catalyst/expressions/aggregate/linearRegression.scala,
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends
DeclarativeAggregate
with ImplicitCastInputTypes with BinaryLike[Expression] {
private val covarPop = new CovPopulation(right, left)
private val varPop = new VariancePop(right)
...... {code}
CovPopulation will filter tuples which right *OR* left is NULL
But VariancePop will only filter null right expression.
This will cause wrong result when some of the tuples' left is null (and right
is not null).
Same reason with RegrIntercept.
A possible fix:
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends
DeclarativeAggregate
with ImplicitCastInputTypes with BinaryLike[Expression] {
private val covarPop = new CovPopulation(right, left)
private val varPop = new VariancePop(If(And(IsNotNull(left),
IsNotNull(right)),
right, Literal.create(null, right.dataType)))
.....{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]