[jira] [Created] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC

John Muller (JIRA) Fri, 07 Apr 2017 12:11:54 -0700

John Muller created SPARK-20259:
-----------------------------------

             Summary: Support push down join optimizations in DataFrameReader 
when loading from JDBC
                 Key: SPARK-20259
                 URL: https://issues.apache.org/jira/browse/SPARK-20259
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.0, 1.6.2
            Reporter: John Muller
            Priority: Minor



Given two dataframes loaded from the same JDBC connection:

{code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid}
val ordersDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "northwind.orders")
  .option("user", "username")
  .option("password", "password")
  .load().toDS
  
val productDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "northwind.product")
  .option("user", "username")
  .option("password", "password")
  .load().toDS
  
ordersDF.createOrReplaceTempView("orders")
productDF.createOrReplaceTempView("product")

// Followed by a join between them:
val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o 
INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
{code}

Catalyst should optimize the query to be:
SELECT northwind.product.name, SUM(northwind.orders.qty)
FROM northwind.orders
INNER JOIN northwind.product ON
  northwind.orders.product_id = northwind.product.product_id
GROUP BY p.name



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC

Reply via email to