[ 
https://issues.apache.org/jira/browse/SPARK-40485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40485:
------------------------------------

    Assignee: Apache Spark

> Extend the partitioning options of the JDBC data source
> -------------------------------------------------------
>
>                 Key: SPARK-40485
>                 URL: https://issues.apache.org/jira/browse/SPARK-40485
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Luca Canali
>            Assignee: Apache Spark
>            Priority: Minor
>
> This proposes to extend the available partitioning options for the JDBC data 
> source.
> Partitioning options allow to read data using multiple workers connected to 
> the target RDBMS. This can improve the performance of data extraction, under 
> the right circumstances.
> Currently the only available partitioning and parallelization option for 
> reading from databases is to specify lowerBound, upperBound, together with 
> numPartitions and partitionColumn. The Spark JDBC data source will then use 
> multiple partitions, and thus workers, to read from the RDBMS.
> This proposes to add a similar, however complementary, mechanism for 
> partitioning, where a user-provided list of values is used to compute the 
> target partitions.
> This provides a way to split the data extraction work among workers that 
> could be aligned with the database physical (partitioned and/or indexed) 
> structure, as in the following example:
> {code:java}
> option("partitionColumn", "region").
> option("numPartitions", 3).
> option("partitionColValues", "'eastern', 'central', 'western'").  {code}
> This feature is motivated for performance reasons, to scale and speed up data 
> extraction from:
>  - list partitioned tables, available in Oracle and PostgreSQL
>  - this is also applicable to tables stored in B*Tree indexes, such as in 
> Oracle's IOTs (Index Organized Tables) and SQL Server's Clustered Indexes.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to