hanover-fiste opened a new pull request #32068: URL: https://github.com/apache/spark/pull/32068
### What changes were proposed in this pull request? Currently, the JDBCRelation columnPartition function orders the strides in ascending order, starting from the lower bound and working its way towards the upper bound. I'm proposing leaving that as the default, but adding a new option in JDBCOptions called strideOrder. Since it will default to the current behavior, this will keep people's current code working as expected. However, people who may have data skew closer to the upper bound might appreciate being able to have the strides in descending order, thus filling up the first partition with the last stride and so forth. Also, people with nondeterministic data skew or sporadic data density might be able to benefit from a random ordering of the strides. ### Why are the changes needed? It provides options to the end-user to have more control over the JDBC partitioning. ### Does this PR introduce _any_ user-facing change? No. By default, the code will order strides/partitions in the same order it has before this PR (ascending). ### How was this patch tested? I've created two new unit tests. ### Example Say we have 11 partitions, 5 cores total, and the last partition has triple the amount of data as the first 10 due to data skew. We'll say partitions 1 through 10 take 10 minutes a core, so partition 11 would take 30 minutes on a core. Current stride order of **ascending**: _Group 1_ 5 cores x 10 minutes = 10 minutes (all running concurrently) total of 10 minutes _Group 2_ 5 x 10 = 10 (all running concurrently) total of 10 minutes _Group 3_ 1 x 30 = 30 4 cores sitting idle total of 30 minutes Group 1 (10) + Group 2 (10) + Group 3 (30) **Total of 50 minutes** Stride order of **descending**: _Group 1_ 1 core x 30 minutes = 30 minutes 4 x 10 = 10 (running concurrently with the first core) 4 x 10 = 10 (running concurrently with the first core) 2 x 10 = 10 (running concurrently with the first core) (running concurrently) total of 30 minutes **Total of 30 minutes** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
