[GitHub] [spark] Stove-hust commented on a change in pull request #35363: [SPARK-38066][SQL] evaluateEquality should ignore attribute without min/max ColumnStat

GitBox Wed, 23 Feb 2022 23:42:34 -0800


Stove-hust commented on a change in pull request #35363:
URL: https://github.com/apache/spark/pull/35363#discussion_r811746435




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
       logDebug("[CBO] No statistics for " + attr)
       return None
     }
+
+    attr.dataType match {
+      case _: NumericType | DateType | TimestampType | BooleanType =>
+        if (!colStatsMap.hasMinMaxStats(attr)) {

Review comment:
       min/max is indeed available, but for me, I don't use min/max at the 
moment, so I don't store it。
   More importantly, based on SPARK-23445 considerations：
   > `min`, `max` and `histogram` for columns were optional already. Having 
them all optional is more consistent......
   

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
       logDebug("[CBO] No statistics for " + attr)
       return None
     }
+
+    attr.dataType match {
+      case _: NumericType | DateType | TimestampType | BooleanType =>
+        if (!colStatsMap.hasMinMaxStats(attr)) {

Review comment:
       > I'm not confident to merge this PR without the answer of this question.
   
   Although the scenario I encountered is rather specific, but I think this is 
still a constructive improvement.
   The evaluateEquality() method and evaluateBinary() method should be 
consistent.
   For numeric types, evaluateBinary() has three judgments: 1. 
!colStatsMap.contains(attr)； 2. !colStatsMap.hasMinMaxStats(attr)； 3. 
!colStatsMap.hasDistinctCount(attr)。
   But evaluateEquality() has only one judgment.

##########
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
##########
@@ -211,6 +215,13 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
       expectedRowCount = 1)
   }
 
+  test("cint5 = 2") {
+    validateEstimatedStats(
+      Filter(EqualTo(attrInt5, Literal(2)), childStatsTestPlan(Seq(attrInt5), 
10L)),
+      Seq(attrInt5 -> ColumnStat(avgLen = Some(4))),

Review comment:
       For ‘cint5’ column, if the evaluateEquality() method returns Some(0.0) 
instead of None, it should be Nil here and the expectedRowCount of the next row 
is 0

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
       logDebug("[CBO] No statistics for " + attr)
       return None
     }
+
+    attr.dataType match {
+      case _: NumericType | DateType | TimestampType | BooleanType =>
+        if (!colStatsMap.hasMinMaxStats(attr)) {
+          logDebug(s"[CBO] No statistics for $attr")
+          return None
+        }
+      case _ =>
+    }
+
     val colStat = colStatsMap(attr)
 
     // decide if the value is in [min, max] of the column.

Review comment:
       If the NumericType column does not store min/max properties, the 
statsInterval will be NullValueInterval(), then the evaluateEquality() method 
will return Some(0.0), not None。
   

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
       logDebug("[CBO] No statistics for " + attr)
       return None
     }
+
+    attr.dataType match {
+      case _: NumericType | DateType | TimestampType | BooleanType =>
+        if (!colStatsMap.hasMinMaxStats(attr)) {

Review comment:
       > I'm not confident to merge this PR without the answer of this question.
   
   The background of the matter is that I am trying to use CBO to solve the 
problem of large tables being broadcast causing Executor OOM.
   Our hive tables are stored in ORC type, and I am trying to randomly sample 
several ORC files to get the avgLen property of each column, and then predict 
the original data size of the read hive table, thus avoiding the situation of 
large tables being broadcast.
   Since the way I selected the orc files is random sampling, I have no way to 
get the exact min/max of each column, I just need an approximate exact avgLen

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
       logDebug("[CBO] No statistics for " + attr)
       return None
     }
+
+    attr.dataType match {
+      case _: NumericType | DateType | TimestampType | BooleanType =>
+        if (!colStatsMap.hasMinMaxStats(attr)) {

Review comment:
       The predecision of the evaluateBinaryForNumeric() method can explain 
this。
   The method path is evaluateBinary() -> evaluateBinaryForNumeric()

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
       logDebug("[CBO] No statistics for " + attr)
       return None
     }
+
+    attr.dataType match {
+      case _: NumericType | DateType | TimestampType | BooleanType =>
+        if (!colStatsMap.hasMinMaxStats(attr)) {
+          logDebug(s"[CBO] No statistics for $attr")
+          return None
+        }
+      case _ =>
+    }
+
     val colStat = colStatsMap(attr)
 
     // decide if the value is in [min, max] of the column.

Review comment:
       Different return values affect the estimate() method of Filter Plan ( 
Line 45, FilterEstimation.scala )
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Stove-hust commented on a change in pull request #35363: [SPARK-38066][SQL] evaluateEquality should ignore attribute without min/max ColumnStat

Reply via email to