Look at the pushdown plans for all the TPCDS queries here <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries> We push Joins, Aggregates, Windowing etc, as I said we can do complete pushdown of 95 of 99 TPCDS queries. The Generic JDBC Datasource push single table scans, filters and partial aggregates. In that case a lot of data is moved from the Oracle instance to Spark, during query execution.
Beyond this, the SQL Macro <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros> feature can translate certain kinds of UDFs to Oracle expressions, which again avoids a lot of data movement because instead of UDF execution happening in Spark an equivalent Oracle expression is evaluated in Oracle. This works on-premise Oracle, currently tested on 19c. regards, Harish. > On Jan 14, 2022, at 2:51 AM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Hello, > > Thanks for this info. > > Have you tested this feature on Oracle on-premise say, 11c, 12c besides ADW > in Cloud? > > I can see the transactional feature useful in terms of commit/rollback to > Oracle but I cannot figure out the performance gains in your blog etc. > > My concern is we currently connect to Oracle as well as many other JDBC > compliant databases through Spark generic JDBC connections > <https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html> with the > same look and feel. Unless there is an overriding reason, I don't see why > there is a need to switch to this feature. > > > Cheers > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Fri, 14 Jan 2022 at 00:50, Harish Butani <rhbutani.sp...@gmail.com > <mailto:rhbutani.sp...@gmail.com>> wrote: > Spark on Oracle is now available as an open source Apache licensed github > repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an > extension jar in your Spark clusters. > > Use it to combine Apache Spark programs with data in your existing Oracle > databases without expensive data copying or query time data movement. > > The core capability is Optimizer extensions that collapse SQL operator > sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical > plan parallelism > <https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be > controlled to split Spark tasks to operate on Oracle data block ranges, or on > resultset pages or on table partitions. > > We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS > queries are completely pushed to Oracle. > <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries> > > With Spark SQL macros > <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros> you can write > custom Spark UDFs that get translated and pushed as Oracle SQL expressions. > > With DML pushdown <https://github.com/oracle/spark-oracle/wiki/DML-Support> > inserts in Spark SQL get pushed as transactionally consistent inserts/updates > on Oracle tables. > > See Quick Start Guide > <https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide> on how to > set up an Oracle free tier ADW instance, load it with TPCDS data and try out > the Spark on Oracle Demo <https://github.com/oracle/spark-oracle/wiki/Demo> > on your Spark cluster. > > More details can be found in our blog > <https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and the > project wiki. <https://github.com/oracle/spark-oracle/wiki> > > regards, > Harish Butani