hanover-fiste opened a new pull request #32068:
URL: https://github.com/apache/spark/pull/32068


   ### What changes were proposed in this pull request?
   Currently, the JDBCRelation columnPartition function orders the strides in 
ascending order, starting from the lower bound and working its way towards the 
upper bound.
   
   I'm proposing leaving that as the default, but adding a new option in 
JDBCOptions called strideOrder. Since it will default to the current behavior, 
this will keep people's current code working as expected. However, people who 
may have data skew closer to the upper bound might appreciate being able to 
have the strides in descending order, thus filling up the first partition with 
the last stride and so forth. Also, people with nondeterministic data skew or 
sporadic data density might be able to benefit from a random ordering of the 
strides.
   
   
   ### Why are the changes needed?
   It provides options to the end-user to have more control over the JDBC 
partitioning.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No. By default, the code will order strides/partitions in the same order it 
has before this PR (ascending).
   
   
   ### How was this patch tested?
   I've created two new unit tests.
   
   ### Example
   Say we have 11 partitions, 5 cores total, and the last partition has triple 
the amount of data as the first 10 due to data skew. We'll say partitions 1 
through 10 take 10 minutes a core, so partition 11 would take 30 minutes on a 
core.
   
   Current stride order of **ascending**:
   
   _Group 1_
   5 cores x 10 minutes = 10 minutes 
   
   (all running concurrently) total of 10 minutes
   
   _Group 2_
   5 x 10 = 10 
   
   (all running concurrently) total of 10 minutes
   
   _Group 3_
   1 x 30 = 30
   
   4 cores sitting idle
   
   total of 30 minutes
   
   Group 1 (10) + Group 2 (10) + Group 3 (30)
   
   **Total of 50 minutes**
   
   
   Stride order of **descending**:
   
   _Group 1_
   1 core x 30 minutes = 30 minutes 
   4 x 10 = 10 (running concurrently with the first core)
   4 x 10 = 10 (running concurrently with the first core)
   2 x 10 = 10 (running concurrently with the first core)
   
   (running concurrently) total of 30 minutes
   
   **Total of 30 minutes**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to