Stove-hust commented on a change in pull request #35363:
URL: https://github.com/apache/spark/pull/35363#discussion_r811746435
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
logDebug("[CBO] No statistics for " + attr)
return None
}
+
+ attr.dataType match {
+ case _: NumericType | DateType | TimestampType | BooleanType =>
+ if (!colStatsMap.hasMinMaxStats(attr)) {
Review comment:
min/max is indeed available, but for me, I don't use min/max at the
moment, so I don't store it。
More importantly, based on SPARK-23445 considerations:
> `min`, `max` and `histogram` for columns were optional already. Having
them all optional is more consistent......
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
logDebug("[CBO] No statistics for " + attr)
return None
}
+
+ attr.dataType match {
+ case _: NumericType | DateType | TimestampType | BooleanType =>
+ if (!colStatsMap.hasMinMaxStats(attr)) {
Review comment:
> I'm not confident to merge this PR without the answer of this question.
Although the scenario I encountered is rather specific, but I think this is
still a constructive improvement.
The evaluateEquality() method and evaluateBinary() method should be
consistent.
For numeric types, evaluateBinary() has three judgments: 1.
!colStatsMap.contains(attr); 2. !colStatsMap.hasMinMaxStats(attr); 3.
!colStatsMap.hasDistinctCount(attr)。
But evaluateEquality() has only one judgment.
##########
File path:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
##########
@@ -211,6 +215,13 @@ class FilterEstimationSuite extends
StatsEstimationTestBase {
expectedRowCount = 1)
}
+ test("cint5 = 2") {
+ validateEstimatedStats(
+ Filter(EqualTo(attrInt5, Literal(2)), childStatsTestPlan(Seq(attrInt5),
10L)),
+ Seq(attrInt5 -> ColumnStat(avgLen = Some(4))),
Review comment:
For ‘cint5’ column, if the evaluateEquality() method returns Some(0.0)
instead of None, it should be Nil here and the expectedRowCount of the next row
is 0
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
logDebug("[CBO] No statistics for " + attr)
return None
}
+
+ attr.dataType match {
+ case _: NumericType | DateType | TimestampType | BooleanType =>
+ if (!colStatsMap.hasMinMaxStats(attr)) {
+ logDebug(s"[CBO] No statistics for $attr")
+ return None
+ }
+ case _ =>
+ }
+
val colStat = colStatsMap(attr)
// decide if the value is in [min, max] of the column.
Review comment:
If the NumericType column does not store min/max properties, the
statsInterval will be NullValueInterval(), then the evaluateEquality() method
will return Some(0.0), not None。
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
logDebug("[CBO] No statistics for " + attr)
return None
}
+
+ attr.dataType match {
+ case _: NumericType | DateType | TimestampType | BooleanType =>
+ if (!colStatsMap.hasMinMaxStats(attr)) {
Review comment:
> I'm not confident to merge this PR without the answer of this question.
The background of the matter is that I am trying to use CBO to solve the
problem of large tables being broadcast causing Executor OOM.
Our hive tables are stored in ORC type, and I am trying to randomly sample
several ORC files to get the avgLen property of each column, and then predict
the original data size of the read hive table, thus avoiding the situation of
large tables being broadcast.
Since the way I selected the orc files is random sampling, I have no way to
get the exact min/max of each column, I just need an approximate exact avgLen
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
logDebug("[CBO] No statistics for " + attr)
return None
}
+
+ attr.dataType match {
+ case _: NumericType | DateType | TimestampType | BooleanType =>
+ if (!colStatsMap.hasMinMaxStats(attr)) {
Review comment:
The predecision of the evaluateBinaryForNumeric() method can explain
this。
The method path is evaluateBinary() -> evaluateBinaryForNumeric()
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
logDebug("[CBO] No statistics for " + attr)
return None
}
+
+ attr.dataType match {
+ case _: NumericType | DateType | TimestampType | BooleanType =>
+ if (!colStatsMap.hasMinMaxStats(attr)) {
+ logDebug(s"[CBO] No statistics for $attr")
+ return None
+ }
+ case _ =>
+ }
+
val colStat = colStatsMap(attr)
// decide if the value is in [min, max] of the column.
Review comment:
Different return values affect the estimate() method of Filter Plan (
Line 45, FilterEstimation.scala )
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]