Nathan Loyer created SPARK-25802:
------------------------------------

             Summary: Use JDBC Oracle Binds from Spark SQL
                 Key: SPARK-25802
                 URL: https://issues.apache.org/jira/browse/SPARK-25802
             Project: Spark
          Issue Type: New Feature
          Components: Input/Output
    Affects Versions: 2.4.0
            Reporter: Nathan Loyer


In case those reading aren't aware, any time a query is run against Oracle, the 
database creates a plan and caches it. When a query is run, first it checks to 
see if it can reuse a plan from the cache. When you use literals it has to 
create a new plan even though there is one in the cache that matches everything 
except for that literal value, which is the case for the spark generated 
queries. Using binds/parameters instead allows the database to reuse the 
previous plans and reduce the load on the database.

My team is using spark sql with JDBC to query large amounts of data from 
production Oracle databases. The queries built with the JDBCRDD class today 
results in our databases having to do more work than they really need to, which 
results in more load on our databases, which affects our users. For this reason 
we've been investigating if it is possible to use spark sql with query 
binds/parameters.  From what I can tell from reviewing documentation and diving 
into the spark source code, this does not appear to be possible today.

Our spark usage looks like this:

{code:java}
spark.read()
    .format("jdbc")
    .option("url", connectionUrl)
    .option("dbtable", "( select c1, c2, c3 from tableName where c4 > 
TO_DATE('2018-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS') )")
    .option("driver", "oracle.jdbc.OracleDriver")
    .option("fetchSize", fetchSize)
    .option("lowerBound", minId)
    .option("upperBound", maxId)
    .option("partitionColumn", "ID")
    .option("numPartitions", numPartitions)
    .load();
{code}

So one way to alter the call to get what I am looking for would be like this:

{code:java}
spark.read()
    .format("jdbc")
    .option("url", connectionUrl)
    .option("dbtable", "( select c1, c2, c3 from tableName where c4 > 
TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS') )")
    .option("driver", "oracle.jdbc.OracleDriver")
    .option("fetchSize", fetchSize)
    .option("lowerBound", minId)
    .option("upperBound", maxId)
    .option("partitionColumn", "ID")
    .option("numPartitions", numPartitions)
    .option("binds", ImmutableMap.of("timestamp", "2018-01-01 00:00:00"))
    .load();
{code}

The queries that spark generates from this should something look like:

{code:sql}
SELECT c1, c2, c3
FROM
  (
    SELECT c1, c2, c3
    FROM tableName
    WHERE column > TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS')
  ) AS __SPARK_GEN_JDBC_SUBQUERY_NAME_1
WHERE
  ID >= :partitionLowerBound
  AND ID < :partitionUpperBound
{code}

I am not certain if this parameterized query syntax is supported by all other 
jdbc drivers or if it improves the performance on those databases or not.

I'm also not sure if I picked the correct component or versions. Feel free to 
correct them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to