Github user ron8hu commented on a diff in the pull request:
https://github.com/apache/spark/pull/19783#discussion_r154254145
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
---
@@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap:
AttributeMap[ColumnStat]) {
def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt)
: AttributeMap[ColumnStat] = {
val newColumnStats = originalMap.map { case (attr, oriColStat) =>
- // Update ndv based on the overall filter selectivity: scale down
ndv if the number of rows
- // decreases; otherwise keep it unchanged.
- val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter,
- newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount)
val colStat =
updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat)
+ val newNdv = if (colStat.distinctCount > 1) {
--- End diff --
The old code does not work well for a couple of new skewed-distribution
tests. For example, test("cintHgm < 3") would fail. Because it still computes
to find newNdv in updateNdv() method. But, in reality, we already scale it
down to 1.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]