yushesp opened a new pull request, #53375:
URL: https://github.com/apache/spark/pull/53375

   ### What changes were proposed in this pull request?
   This PR adds a new repartition overload to Dataset[T] that accepts a key 
extraction function and a custom Partitioner, similar to RDD's partitionBy:
   ```
   def repartition[K: Encoder](keyFunc: T => K, partitioner: Partitioner): 
Dataset[T]
   ```
   
   ### Why are the changes needed?
   Currently, Dataset users who want custom partitioning logic must drop down 
to the RDD API, losing the benefits of Catalyst optimization and the typed 
Dataset API.
   
   Custom partitioning logic could be useful when:
   - You want to co-partition two datasets by the same key so that joins don't 
require a shuffle
   - You need custom bucketing logic beyond what HashPartitioner provides
   
   The RDD API has supported custom partitioners via partitionBy since Spark's 
early days. This PR brings the same capability to the Dataset API.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Adds a new public API method to Dataset:
   ```
   def repartition[K: Encoder](keyFunc: T => K, partitioner: Partitioner): 
Dataset[T]
   ```
   
   
   ### How was this patch tested?
   Added unit tests in PlannerSuite.scala covering basic functionality of new 
API.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this 
patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling 
Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   Co-Generated-by: Cursor 2.1.46


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to