Github user ron8hu commented on a diff in the pull request:
https://github.com/apache/spark/pull/19783#discussion_r155963930
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
---
@@ -359,7 +371,7 @@ class FilterEstimationSuite extends
StatsEstimationTestBase {
test("cbool > false") {
validateEstimatedStats(
Filter(GreaterThan(attrBool, Literal(false)),
childStatsTestPlan(Seq(attrBool), 10L)),
- Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max
= Some(true),
+ Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max
= Some(true),
--- End diff --
Agreed with wzhfy. Today's logic is: for these 2 conditions, (column > x)
and (column >= x), we set the min value to x. We do not distinguish these 2
cases. This is because we do not know the exact next value larger than x if x
is a continuous data type like double type. We may do some special coding for
discrete data types such as Boolean or integer. But, as wzhfy said, it does
not deserve the complexity.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]