[PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

via GitHub Wed, 06 Mar 2024 10:59:52 -0800


johnnywalker opened a new pull request, #45410:
URL: https://github.com/apache/spark/pull/45410

Registered dialects may differ between driver and executors. Bind dialect to
the RDD on creation to use the same dialect regardless of JVM state.

<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in

'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the
guideline first in
'common/utils/src/main/resources/error/README.md'.
-->

### What changes were proposed in this pull request?

Bind the resolved JDBC dialect to the `JDBCRDD` instance during construction.

### Why are the changes needed?

While working on a Spark application, I created a custom JDBC dialect and
registered before creating a `SparkSession`. Spark did not use the dialect's
`JdbcSQLQueryBuilder`, however, so I investigated further. I discovered that
`JDBCRDD` [repeats the dialect
resolution](https://github.com/apache/spark/blame/fe2174d8caaf6f9474319a8d729a2f537038b4c1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L177),
and this runs on the executor.

From what I can understand, additional dialects registered by the driver
will not be available in the executor when running in cluster mode.
Consequently, I propose binding the resolved dialect to the `JDBCRDD` instance
to produce deterministic behavior.

### Does this PR introduce _any_ user-facing change?

No aside from resolving a bug that some users may have encountered.

### How was this patch tested?

I have not written tests, but I have manually tested the fix on a local
3.5.1 cluster. Applying this change produced the desired behavior (i.e. the
custom dialect's `JdbcSQLQueryBuilder` was used to generate queries from
executors).

### Was this patch authored or co-authored using generative AI tooling?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Reply via email to