Jason Yarbrough created SPARK-34910:
---------------------------------------

             Summary: Add an option for different stride orders
                 Key: SPARK-34910
                 URL: https://issues.apache.org/jira/browse/SPARK-34910
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Jason Yarbrough


Currently, the JDBCRelation columnPartition function orders the strides in 
ascending order, starting from the lower bound and working its way towards the 
upper bound.

I'm proposing leaving that as the default, but adding an option (such as 
strideOrder) in JDBCOptions. Since it will default to the current behavior, 
this will keep people's current code working as expected. However, people who 
may have data skew closer to the upper bound might appreciate being able to 
have the strides in descending order, thus filling up the first partition with 
the last stride and so forth. Also, people with nondeterministic data skew or 
sporadic data density might be able to benefit from a random ordering of the 
strides.

I have the code created to implement this, and it creates a pattern that can be 
used to add other algorithms that people may want to add (such as counting the 
rows and ranking each stride, and then ordering from most dense to least). The 
current two options I have coded is 'descending' and 'random.'

The original idea was to create something closer to Spark's hash partitioner, 
but for JDBC and pushed down to the database engine for efficiency. However, 
that would require adding hashing algorithms for each dialect, and the 
performance from those algorithms may outweigh the benefit. The method I'm 
proposing in this ticket avoids those complexities while still giving some of 
the benefit (in the case of random ordering).

I'll put a PR in if others feel this is a good idea.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to