[ 
https://issues.apache.org/jira/browse/SPARK-25802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25802:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> Use JDBC Oracle Binds from Spark SQL
> ------------------------------------
>
>                 Key: SPARK-25802
>                 URL: https://issues.apache.org/jira/browse/SPARK-25802
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Nathan Loyer
>            Priority: Major
>
> In case those reading aren't aware, any time a query is run against Oracle, 
> the database creates a plan and caches it. When a query is run, first it 
> checks to see if it can reuse a plan from the cache. When you use literals it 
> has to create a new plan even though there is one in the cache that matches 
> everything except for that literal value, which is the case for the spark 
> generated queries. Using binds/parameters instead allows the database to 
> reuse the previous plans and reduce the load on the database.
> My team is using spark sql with JDBC to query large amounts of data from 
> production Oracle databases. The queries built with the JDBCRDD class today 
> results in our databases having to do more work than they really need to, 
> which results in more load on our databases, which affects our users. For 
> this reason we've been investigating if it is possible to use spark sql with 
> query binds/parameters.  From what I can tell from reviewing documentation 
> and diving into the spark source code, this does not appear to be possible 
> today.
> Our spark usage looks like this:
> {code:java}
> spark.read()
>     .format("jdbc")
>     .option("url", connectionUrl)
>     .option("dbtable", "( select c1, c2, c3 from tableName where c4 > 
> TO_DATE('2018-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS') )")
>     .option("driver", "oracle.jdbc.OracleDriver")
>     .option("fetchSize", fetchSize)
>     .option("lowerBound", minId)
>     .option("upperBound", maxId)
>     .option("partitionColumn", "ID")
>     .option("numPartitions", numPartitions)
>     .load();
> {code}
> So one way to alter the call to get what I am looking for would be like this:
> {code:java}
> spark.read()
>     .format("jdbc")
>     .option("url", connectionUrl)
>     .option("dbtable", "( select c1, c2, c3 from tableName where c4 > 
> TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS') )")
>     .option("driver", "oracle.jdbc.OracleDriver")
>     .option("fetchSize", fetchSize)
>     .option("lowerBound", minId)
>     .option("upperBound", maxId)
>     .option("partitionColumn", "ID")
>     .option("numPartitions", numPartitions)
>     .option("binds", ImmutableMap.of("timestamp", "2018-01-01 00:00:00"))
>     .load();
> {code}
> The queries that spark generates from this should something look like:
> {code:sql}
> SELECT c1, c2, c3
> FROM
>   (
>     SELECT c1, c2, c3
>     FROM tableName
>     WHERE column > TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS')
>   ) AS __SPARK_GEN_JDBC_SUBQUERY_NAME_1
> WHERE
>   ID >= :partitionLowerBound
>   AND ID < :partitionUpperBound
> {code}
> I am not certain if this parameterized query syntax is supported by all other 
> jdbc drivers or if it improves the performance on those databases or not.
> I'm also not sure if I picked the correct component or versions. Feel free to 
> correct them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to