cloud-fan commented on pull request #31653:
URL: https://github.com/apache/spark/pull/31653#issuecomment-802140005
It's pretty late in my timezone now and we can have a video call tomorrow if
necessary. Feel free to email me (you can see my email address in my GitHub
profile)
> queryStagePreparationRules are run in reOptimize and CostEvaluator runs
after reOptimize
Ah, I did miss something. You are right about it. What the `CostEvaluator`
evaluates is: current physical plan -> logical plan -> run logical optimizer
rules -> run planner -> run preparation rules. This means that the
`CostEvaluator` runs right after the `OptimizeSkewedJoin` rule, then it makes
more sense to let the `CostEvaluator` decide if extra shuffles are accepted or
not.
We can follow the idea from @maryannxue , and create a
`SkewJoinAwareCostEvaluator` if the conf is enabled. This cost evaluator not
only tracks the number of shuffles, but also the number of skewed joins (by
checking `SortMergeJoinExec.isSkewJoin`). e.g.
```
case class SkewJoinAwareCost(numShuffles: Int, numSkewJoins: Int) {
override def compare(that: Cost): Int = that match {
case other: SkewJoinAwareCost =>
if (numSkewJoins > other.numSkewJoins) {
// If more skew joins are optimized, it means the cost is lower
-1
} else if (numShuffles < other.numShuffles) -1 else if (numShuffles >
other.numShuffles) 1 else 0
case _ =>
throw new IllegalArgumentException(s"Could not compare cost with
$that")
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]