Andreas Franz created SPARK-55830:
-------------------------------------
Summary: JDBC predicate pushdown drops driver properties causing
connection failure
Key: SPARK-55830
URL: https://issues.apache.org/jira/browse/SPARK-55830
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.0.2
Reporter: Andreas Franz
When reading data via JDBC using *predicate pushdown (predicates)*, Spark
appears to lose custom *JDBC driver properties* during connection creation.
This becomes problematic when using a *custom SocketFactory*, such as the
*Google Cloud SQL JDBC SocketFactory*, which relies on additional driver
properties for establishing the connection.
As a result, Spark fails to connect to the database when predicates are used.
Without predicates, the connection works correctly.
*Examples to reproduce:*
{code:java}
def get_jdbc_parameters():
return {
"url": f"jdbc:postgresql:///<DB_NAME>",
"socketFactory": "com.google.cloud.sql.postgres.SocketFactory",
"user": "<SERVICE ACCOUNT AUTHORIZED TO ACCESS CLOUDSQL DATABASE>",
"cloudSqlInstance": "<INSTANCE_CONNECTION_NAME>",
"cloudSqlTargetPrincipal": "<SERVICE ACCOUNT AUTHORIZED TO ACCESS
CLOUDSQL DATABASE>",
"enableIamAuth": "true",
"password": "disabled",
"sslmode": "disable",
}
def get_table_name():
return "<TABLE>"
# CASE 1: No predicates -> returns up to 10 rows as expected
jdbc_parameters = get_jdbc_parameters()
spark.read.jdbc(
url=jdbc_parameters.pop("url"),
table=get_table_name(),
properties=jdbc_parameters,
).limit(10).display()
# CASE 2: Empty predicated -> fails with exception below
jdbc_parameters = get_jdbc_parameters()
predicates = []
spark.read.jdbc(
url=jdbc_parameters.pop("url"),
table=get_table_name(),
properties=jdbc_parameters,
predicates=predicates,
).limit(10).display()
'''
Py4JJavaError: An error occurred while calling o488.jdbc.
: org.postgresql.util.PSQLException: Connection to :5432 refused. Check that
the hostname and port are correct and that the postmaster is accepting TCP/IP
connections.
at
org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:342)
at
org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
'''
# CASE 3: Non-empty predicates -> fails with exception below
jdbc_parameters = get_jdbc_parameters()
predicates = ["id BETWEEN 1000 AND 1010 "]
spark.read.jdbc(
url=jdbc_parameters.pop("url"),
table=get_table_name(),
properties=jdbc_parameters,
predicates=predicates,
).limit(10).display()
'''
Py4JJavaError: An error occurred while calling o488.jdbc.
: org.postgresql.util.PSQLException: Connection to :5432 refused. Check that
the hostname and port are correct and that the postmaster is accepting TCP/IP
connections.
at
org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:342)
at
org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
'''
# CASE 4: Predicates with JDBC parameters encoded in URL as workaround -> only
returns entries of the given predicate range
jdbc_parameters = get_jdbc_parameters()
predicates = ["id BETWEEN 1000 AND 1010 "]
url = jdbc_parameters.pop("url")
params = "&".join(f"{k}={v}" for k, v in jdbc_parameters.items())
spark.read.jdbc(
url=f"{url}?{params}",
table=get_table_name(),
predicates=predicates,
).limit(10).display()
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]