John Muller created SPARK-20259:
-----------------------------------
Summary: Support push down join optimizations in DataFrameReader
when loading from JDBC
Key: SPARK-20259
URL: https://issues.apache.org/jira/browse/SPARK-20259
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.1.0, 1.6.2
Reporter: John Muller
Priority: Minor
Given two dataframes loaded from the same JDBC connection:
{code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid}
val ordersDF = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "northwind.orders")
.option("user", "username")
.option("password", "password")
.load().toDS
val productDF = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "northwind.product")
.option("user", "username")
.option("password", "password")
.load().toDS
ordersDF.createOrReplaceTempView("orders")
productDF.createOrReplaceTempView("product")
// Followed by a join between them:
val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o
INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
{code}
Catalyst should optimize the query to be:
SELECT northwind.product.name, SUM(northwind.orders.qty)
FROM northwind.orders
INNER JOIN northwind.product ON
northwind.orders.product_id = northwind.product.product_id
GROUP BY p.name
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]