Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

via GitHub Tue, 12 Mar 2024 06:10:54 -0700


advancedxy commented on code in PR #45267:
URL: https://github.com/apache/spark/pull/45267#discussion_r1521439073



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala:
##########
@@ -635,6 +636,22 @@ trait ShuffleSpec {
    */
   def createPartitioning(clustering: Seq[Expression]): Partitioning =
     throw SparkUnsupportedOperationException()
+
+  /**
+   * Return a set of [[Reducer]] for the partition expressions of this shuffle 
spec,
+   * on the partition expressions of another shuffle spec.
+   * <p>
+   * A [[Reducer]] exists for a partition expression function of this shuffle 
spec if it is
+   * 'reducible' on the corresponding partition expression function of the 
other shuffle spec.
+   * <p>
+   * If a value is returned, there must be one Option[[Reducer]] per partition 
expression.
+   * A None value in the set indicates that the particular partition 
expression is not reducible
+   * on the corresponding expression on the other shuffle spec.
+   * <p>
+   * Returning none also indicates that none of the partition expressions can 
be reduced on the
+   * corresponding expression on the other shuffle spec.
+   */
+  def reducers(spec: ShuffleSpec): Option[Seq[Option[Reducer[_]]]] = None

Review Comment:
   > For instance, geo bucketing functions as well. 
   
   emm, of course, geo bucketing makes a lot of sense.
   
   > I think could be `hours(col)` vs `days(col)`
   
   I'm not sure about this use case. Theoretically, we can "reduce" the former 
into the latter, however it seems impractical to me to "reduce" hours into 
days. Suppose we have table A with `hours(col)` partition transform, and table 
B with `days(col)` partition transform and we are going to do join with `A.col 
= B.col`. If A's hours partitions are reduced to days partition, it means we 
need to process ~24 times partition data in one task, which might already been 
big enough from table A's perspective?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

Reply via email to