[GitHub] [spark] Stove-hust commented on a change in pull request #35363: [SPARK-38066][SQL] evaluateEquality should ignore attribute without min/max ColumnStat

GitBox Thu, 24 Feb 2022 00:30:40 -0800


Stove-hust commented on a change in pull request #35363:
URL: https://github.com/apache/spark/pull/35363#discussion_r813650008




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
##########
@@ -311,6 +311,16 @@ case class FilterEstimation(plan: Filter) extends Logging {
       logDebug("[CBO] No statistics for " + attr)
       return None
     }
+
+    attr.dataType match {
+      case _: NumericType | DateType | TimestampType | BooleanType =>
+        if (!colStatsMap.hasMinMaxStats(attr)) {

Review comment:
       > I'm not confident to merge this PR without the answer of this question.
   
   The background is that I am solving the problem of Executor OOM caused by 
large tables being broadcast through CBO.
   Our hive tables are all stored in ORC format, and our HMS does not store the 
statistics of each column, I tried to get the statistics of each column by 
reading the ORC Footer. But it is too expensive to count all ORC files, I 
choose to sample some ORC files to get avgLen, and there is no way to get the 
exact min/max property in this case.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Stove-hust commented on a change in pull request #35363: [SPARK-38066][SQL] evaluateEquality should ignore attribute without min/max ColumnStat

Reply via email to