[GitHub] [spark] LucaCanali opened a new pull request, #37928: [SPARK-40485][SQL] Extend the partitioning options of the JDBC data source

GitBox Mon, 19 Sep 2022 01:27:04 -0700


LucaCanali opened a new pull request, #37928:
URL: https://github.com/apache/spark/pull/37928


   ### What changes were proposed in this pull request?
   This proposes to extend the available partitioning options for the JDBC data 
source.
   
   ### Why are the changes needed?
   Partitioning options allow to read data using multiple workers connected to 
the target RDBMS. This can improve the performance of data extraction, under 
the right circumstances.
   
   Currently the only available partitioning and parallelization option for 
reading from databases is to specify lowerBound, upperBound, together with 
numPartitions and partitionColumn. The Spark JDBC data source will then use 
multiple partitions, and thus workers, to read from the RDBMS.   
   This proposes to add a similar, however complementary, mechanism for 
partitioning, where a user-provided list of values is used to compute the 
target partitions.   
   This provides a way to split the data extraction work among workers that 
could be aligned with the database physical (partitioned and/or indexed) 
structure, as in the following example:  
   ```
   option("partitionColumn", "region").
   option("numPartitions", 3).
   option("partitionColValues", "'eastern', 'central', 'western'").  
   ```
   
   This feature is motivated for performance reasons, to scale and speed up 
data extraction from:
    - list partitioned tables, available in Oracle and PostgreSQL
    - this is also applicable to tables stored in B*Tree indexes, such as in 
Oracle's IOTs (Index Organized Tables) and SQL Server's Clustered Indexes.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, this adds the option "partitionColValues" to the JDBC data source.
   
   ### How was this patch tested?
   Added tests to the JDBCSuite and JDBCV2Suite.
   Also manually tested against Oracle's list partitioned tables.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LucaCanali opened a new pull request, #37928: [SPARK-40485][SQL] Extend the partitioning options of the JDBC data source

Reply via email to