cloud-fan commented on pull request #31653:
URL: https://github.com/apache/spark/pull/31653#issuecomment-802140005


   It's pretty late in my timezone now and we can have a video call tomorrow if 
necessary. Feel free to email me (you can see my email address in my GitHub 
profile)
   
   > queryStagePreparationRules are run in reOptimize and CostEvaluator runs 
after reOptimize
   
   Ah, I did miss something. You are right about it. What the `CostEvaluator` 
evaluates is: current physical plan -> logical plan -> run logical optimizer 
rules -> run planner -> run preparation rules. This means that the 
`CostEvaluator` runs right after the `OptimizeSkewedJoin` rule, then it makes 
more sense to let the `CostEvaluator` decide if extra shuffles are accepted or 
not.
   
   We can follow the idea from @maryannxue , and create a 
`SkewJoinAwareCostEvaluator` if the conf is enabled. This cost evaluator not 
only tracks the number of shuffles, but also the number of skewed joins (by 
checking `SortMergeJoinExec.isSkewJoin`). e.g.
   ```
   case class SkewJoinAwareCost(numShuffles: Int, numSkewJoins: Int) {
     override def compare(that: Cost): Int = that match {
       case other: SkewJoinAwareCost =>
         if (numSkewJoins > other.numSkewJoins) {
           // If more skew joins are optimized, it means the cost is lower
           -1
         } else if (numShuffles < other.numShuffles) -1 else if (numShuffles > 
other.numShuffles) 1 else 0
       case _ =>
         throw new IllegalArgumentException(s"Could not compare cost with 
$that")
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to