Nathan Loyer created SPARK-25802:
------------------------------------
Summary: Use JDBC Oracle Binds from Spark SQL
Key: SPARK-25802
URL: https://issues.apache.org/jira/browse/SPARK-25802
Project: Spark
Issue Type: New Feature
Components: Input/Output
Affects Versions: 2.4.0
Reporter: Nathan Loyer
In case those reading aren't aware, any time a query is run against Oracle, the
database creates a plan and caches it. When a query is run, first it checks to
see if it can reuse a plan from the cache. When you use literals it has to
create a new plan even though there is one in the cache that matches everything
except for that literal value, which is the case for the spark generated
queries. Using binds/parameters instead allows the database to reuse the
previous plans and reduce the load on the database.
My team is using spark sql with JDBC to query large amounts of data from
production Oracle databases. The queries built with the JDBCRDD class today
results in our databases having to do more work than they really need to, which
results in more load on our databases, which affects our users. For this reason
we've been investigating if it is possible to use spark sql with query
binds/parameters. From what I can tell from reviewing documentation and diving
into the spark source code, this does not appear to be possible today.
Our spark usage looks like this:
{code:java}
spark.read()
.format("jdbc")
.option("url", connectionUrl)
.option("dbtable", "( select c1, c2, c3 from tableName where c4 >
TO_DATE('2018-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS') )")
.option("driver", "oracle.jdbc.OracleDriver")
.option("fetchSize", fetchSize)
.option("lowerBound", minId)
.option("upperBound", maxId)
.option("partitionColumn", "ID")
.option("numPartitions", numPartitions)
.load();
{code}
So one way to alter the call to get what I am looking for would be like this:
{code:java}
spark.read()
.format("jdbc")
.option("url", connectionUrl)
.option("dbtable", "( select c1, c2, c3 from tableName where c4 >
TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS') )")
.option("driver", "oracle.jdbc.OracleDriver")
.option("fetchSize", fetchSize)
.option("lowerBound", minId)
.option("upperBound", maxId)
.option("partitionColumn", "ID")
.option("numPartitions", numPartitions)
.option("binds", ImmutableMap.of("timestamp", "2018-01-01 00:00:00"))
.load();
{code}
The queries that spark generates from this should something look like:
{code:sql}
SELECT c1, c2, c3
FROM
(
SELECT c1, c2, c3
FROM tableName
WHERE column > TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS')
) AS __SPARK_GEN_JDBC_SUBQUERY_NAME_1
WHERE
ID >= :partitionLowerBound
AND ID < :partitionUpperBound
{code}
I am not certain if this parameterized query syntax is supported by all other
jdbc drivers or if it improves the performance on those databases or not.
I'm also not sure if I picked the correct component or versions. Feel free to
correct them.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]