[ 
https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough updated SPARK-34910:
------------------------------------
    Description: 
Currently, the JDBCRelation columnPartition function orders the strides in 
ascending order, starting from the lower bound and working its way towards the 
upper bound.

I'm proposing leaving that as the default, but adding an option (such as 
strideOrder) in JDBCOptions. Since it will default to the current behavior, 
this will keep people's current code working as expected. However, people who 
may have data skew closer to the upper bound might appreciate being able to 
have the strides in descending order, thus filling up the first partition with 
the last stride and so forth. Also, people with nondeterministic data skew or 
sporadic data density might be able to benefit from a random ordering of the 
strides.

I have the code created to implement this, and it creates a pattern that can be 
used to add other algorithms that people may want to add (such as counting the 
rows and ranking each stride, and then ordering from most dense to least). The 
current two options I have coded is 'descending' and 'random.'

The original idea was to create something closer to Spark's hash partitioner, 
but for JDBC and pushed down to the database engine for efficiency. However, 
that would require adding hashing algorithms for each dialect, and the 
performance from those algorithms may outweigh the benefit. The method I'm 
proposing in this ticket avoids those complexities while still giving some of 
the benefit (in the case of random ordering).

  was:
Currently, the JDBCRelation columnPartition function orders the strides in 
ascending order, starting from the lower bound and working its way towards the 
upper bound.

I'm proposing leaving that as the default, but adding an option (such as 
strideOrder) in JDBCOptions. Since it will default to the current behavior, 
this will keep people's current code working as expected. However, people who 
may have data skew closer to the upper bound might appreciate being able to 
have the strides in descending order, thus filling up the first partition with 
the last stride and so forth. Also, people with nondeterministic data skew or 
sporadic data density might be able to benefit from a random ordering of the 
strides.

I have the code created to implement this, and it creates a pattern that can be 
used to add other algorithms that people may want to add (such as counting the 
rows and ranking each stride, and then ordering from most dense to least). The 
current two options I have coded is 'descending' and 'random.'

The original idea was to create something closer to Spark's hash partitioner, 
but for JDBC and pushed down to the database engine for efficiency. However, 
that would require adding hashing algorithms for each dialect, and the 
performance from those algorithms may outweigh the benefit. The method I'm 
proposing in this ticket avoids those complexities while still giving some of 
the benefit (in the case of random ordering).

I'll put a PR in if others feel this is a good idea.


> JDBC - Add an option for different stride orders
> ------------------------------------------------
>
>                 Key: SPARK-34910
>                 URL: https://issues.apache.org/jira/browse/SPARK-34910
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Jason Yarbrough
>            Priority: Trivial
>
> Currently, the JDBCRelation columnPartition function orders the strides in 
> ascending order, starting from the lower bound and working its way towards 
> the upper bound.
> I'm proposing leaving that as the default, but adding an option (such as 
> strideOrder) in JDBCOptions. Since it will default to the current behavior, 
> this will keep people's current code working as expected. However, people who 
> may have data skew closer to the upper bound might appreciate being able to 
> have the strides in descending order, thus filling up the first partition 
> with the last stride and so forth. Also, people with nondeterministic data 
> skew or sporadic data density might be able to benefit from a random ordering 
> of the strides.
> I have the code created to implement this, and it creates a pattern that can 
> be used to add other algorithms that people may want to add (such as counting 
> the rows and ranking each stride, and then ordering from most dense to 
> least). The current two options I have coded is 'descending' and 'random.'
> The original idea was to create something closer to Spark's hash partitioner, 
> but for JDBC and pushed down to the database engine for efficiency. However, 
> that would require adding hashing algorithms for each dialect, and the 
> performance from those algorithms may outweigh the benefit. The method I'm 
> proposing in this ticket avoids those complexities while still giving some of 
> the benefit (in the case of random ordering).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to