[jira] [Resolved] (SPARK-22299) Use OFFSET and LIMIT for JDBC DataFrameReader striping

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:39:54 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-22299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-22299.
----------------------------------
    Resolution: Incomplete

> Use OFFSET and LIMIT for JDBC DataFrameReader striping
> ------------------------------------------------------
>
>                 Key: SPARK-22299
>                 URL: https://issues.apache.org/jira/browse/SPARK-22299
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>            Reporter: Zack Behringer
>            Priority: Minor
>              Labels: bulk-closed
>
> Loading a large table (300M rows) from JDBC can be partitioned into tasks 
> using the column, numPartitions, lowerBound and upperBound parameters on 
> DataFrameReader.jdbc(), but that becomes troublesome if the column is 
> skewed/fragmented (as in somebody used a global sequence for the partition 
> column instead of a sequence specific to the table, or if the table becomes 
> fragmented by deletes, etc.).
> This can be worked around by using a modulus operation on the column, but 
> that will be slow unless there is a already an index using the modulus 
> expression with the exact numPartitions value, so that doesn't scale well if 
> you want to change the number partitions. Another way would be to use an 
> expression index on a hash of the partition column, but I'm not sure if JDBC 
> striping is smart enough to create hash ranges for each stripe using hashes 
> of the lower and upper bound parameters. If it is, that is great, but still 
> that requires a very large index just for this use case.
> A less invasive approach would be to use the table's physical ordering along 
> with OFFSET and LIMIT so that only the total number of records to read would 
> need to be known beforehand in order to evenly distribute, no indexes needed. 
> I realize that OFFSET and LIMIT are not standard SQL keywords.
> I also see that a list of custom predicates can be defined. I haven't tried 
> that to see if I can embed numPartitions specific predicates each with their 
> own OFFSET and LIMIT range.
> Some relational databases take quite a long time to count the number of 
> records in order to determine the stripe size, though, so this can also 
> troublesome. Could a feature similar to "spark.sql.files.maxRecordsPerFile" 
> be used in conjunction with the number of executors to read manageable 
> batches (internally using OFFSET and LIMIT) until there are no more available 
> results?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-22299) Use OFFSET and LIMIT for JDBC DataFrameReader striping

Reply via email to