GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/14907
[SPARK-17351] Refactor JDBCRDD to expose ResultSet -> Seq[Row] utility
methods
This patch refactors the internals of the JDBC data source in order to
allow some of its code to be re-used in an automated comparison testing
harness. Here are the key changes:
- Move the JDBC `ResultSetMetadata` to `StructType` conversion logic from
`JDBCRDD.resolveTable()` to the `JdbcUtils` object (as a new
`getSchema(ResultSet, JdbcDialect)` method), allowing it to be applied on
`ResultSet`s that are created elsewhere.
- Move the `ResultSet` to `InternalRow` conversion methods from `JDBCRDD`
to `JdbcUtils`:
- It makes sense to move the `JDBCValueGetter` type and `makeGetter`
functions here given that their write-path counterparts (`JDBCValueSetter`) are
already in `JdbcUtils`.
- Add an internal `resultSetToSparkInternalRows` method which takes a
`ResultSet` and schema and returns an `Iterator[InternalRow]`. This effectively
extracts the main loop of `JDBCRDD` into its own method.
- Add a public `resultSetToRows` method to `JdbcUtils`, which wraps the
minimal machinery around `resultSetToSparkInternalRows` in order to allow it to
be called from outside of a Spark job.
- Make `JdbcDialect.get` into a `DeveloperApi` (`JdbcDialect` itself is
already a `DeveloperApi`).
Put together, these changes enable the following testing pattern:
```scala
val jdbResultSet: ResultSet = conn.prepareStatement(query).executeQuery()
val resultSchema: StructType = JdbcUtils.getSchema(jdbResultSet,
JdbcDialects.get("jdbc:postgresql"))
val jdbcRows: Seq[Row] = JdbcUtils.resultSetToRows(jdbResultSet,
schema).toSeq
checkAnswer(sparkResult, jdbcRows) // in a test case
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark modularize-jdbc-internals
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14907.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14907
----
commit 17d770a85e0921a5bfcbe00aead71cf169f76119
Author: Josh Rosen <[email protected]>
Date: 2016-08-23T19:02:55Z
Move ResultSet -> Seq[InternalRow] conversion into JdbcUtils
commit 682b5917341a530627a5d873196f6c4a3259a91b
Author: Josh Rosen <[email protected]>
Date: 2016-08-23T19:07:44Z
Make new method private[spark]
commit ec49accbf2c532766fb66c3b8910fd7c81563839
Author: Josh Rosen <[email protected]>
Date: 2016-08-23T19:18:03Z
Move getCatalystType to JdbcUtils and add new getSchema() method.
commit 025c9d08d485ebfab0dd23f1ed5065e537ad0437
Author: Josh Rosen <[email protected]>
Date: 2016-08-23T19:54:54Z
Add public resultSetToRows() method for converting to public rows.
commit 05dfe5276017862dadf8791672de818329d52723
Author: Josh Rosen <[email protected]>
Date: 2016-08-24T02:25:35Z
Remove InputMetrics from a public API.
commit fca548ae24bbc60e8c04ea4e6756dfb19942fb61
Author: Josh Rosen <[email protected]>
Date: 2016-08-25T00:49:28Z
Open up JdbcDialects.get as developerapi.
commit 43cbef6b4310dd9af08672bcaa01d8114b1fe5fc
Author: Josh Rosen <[email protected]>
Date: 2016-09-01T00:32:25Z
Merge remote-tracking branch 'origin/master' into modularize-jdbc-internals
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]