Jason Yarbrough created SPARK-34910:
---------------------------------------
Summary: Add an option for different stride orders
Key: SPARK-34910
URL: https://issues.apache.org/jira/browse/SPARK-34910
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.2.0
Reporter: Jason Yarbrough
Currently, the JDBCRelation columnPartition function orders the strides in
ascending order, starting from the lower bound and working its way towards the
upper bound.
I'm proposing leaving that as the default, but adding an option (such as
strideOrder) in JDBCOptions. Since it will default to the current behavior,
this will keep people's current code working as expected. However, people who
may have data skew closer to the upper bound might appreciate being able to
have the strides in descending order, thus filling up the first partition with
the last stride and so forth. Also, people with nondeterministic data skew or
sporadic data density might be able to benefit from a random ordering of the
strides.
I have the code created to implement this, and it creates a pattern that can be
used to add other algorithms that people may want to add (such as counting the
rows and ranking each stride, and then ordering from most dense to least). The
current two options I have coded is 'descending' and 'random.'
The original idea was to create something closer to Spark's hash partitioner,
but for JDBC and pushed down to the database engine for efficiency. However,
that would require adding hashing algorithms for each dialect, and the
performance from those algorithms may outweigh the benefit. The method I'm
proposing in this ticket avoids those complexities while still giving some of
the benefit (in the case of random ordering).
I'll put a PR in if others feel this is a good idea.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]