[
https://issues.apache.org/jira/browse/SPARK-22299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-22299.
----------------------------------
Resolution: Incomplete
> Use OFFSET and LIMIT for JDBC DataFrameReader striping
> ------------------------------------------------------
>
> Key: SPARK-22299
> URL: https://issues.apache.org/jira/browse/SPARK-22299
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
> Reporter: Zack Behringer
> Priority: Minor
> Labels: bulk-closed
>
> Loading a large table (300M rows) from JDBC can be partitioned into tasks
> using the column, numPartitions, lowerBound and upperBound parameters on
> DataFrameReader.jdbc(), but that becomes troublesome if the column is
> skewed/fragmented (as in somebody used a global sequence for the partition
> column instead of a sequence specific to the table, or if the table becomes
> fragmented by deletes, etc.).
> This can be worked around by using a modulus operation on the column, but
> that will be slow unless there is a already an index using the modulus
> expression with the exact numPartitions value, so that doesn't scale well if
> you want to change the number partitions. Another way would be to use an
> expression index on a hash of the partition column, but I'm not sure if JDBC
> striping is smart enough to create hash ranges for each stripe using hashes
> of the lower and upper bound parameters. If it is, that is great, but still
> that requires a very large index just for this use case.
> A less invasive approach would be to use the table's physical ordering along
> with OFFSET and LIMIT so that only the total number of records to read would
> need to be known beforehand in order to evenly distribute, no indexes needed.
> I realize that OFFSET and LIMIT are not standard SQL keywords.
> I also see that a list of custom predicates can be defined. I haven't tried
> that to see if I can embed numPartitions specific predicates each with their
> own OFFSET and LIMIT range.
> Some relational databases take quite a long time to count the number of
> records in order to determine the stripe size, though, so this can also
> troublesome. Could a feature similar to "spark.sql.files.maxRecordsPerFile"
> be used in conjunction with the number of executors to read manageable
> batches (internally using OFFSET and LIMIT) until there are no more available
> results?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]