[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...

dilipbiswal Tue, 26 Jun 2018 15:21:03 -0700

Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21590#discussion_r198315709
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 ---
    @@ -109,6 +134,20 @@ class JDBCOptions(
         s"When reading JDBC data sources, users need to specify all or none 
for the following " +
           s"options: '$JDBC_PARTITION_COLUMN', '$JDBC_LOWER_BOUND', 
'$JDBC_UPPER_BOUND', " +
           s"and '$JDBC_NUM_PARTITIONS'")
    +
    +  require(!(query.isDefined && partitionColumn.isDefined),
    +    s"""
    +       |Options '$JDBC_QUERY_STRING' and '$JDBC_PARTITION_COLUMN' can not 
be specified together.
    +       |Please define the query using `$JDBC_TABLE_NAME` option instead 
and make sure to qualify
    +       |the partition columns using the supplied subquery alias to resolve 
any ambiguity.
    +       |Example :
    +       |spark.read.format("jdbc")
    +       |        .option("dbtable", "(select c1, c2 from t1) as subq")
    +       |        .option("partitionColumn", "subq.c1"
    +       |        .load()
    +     """.stripMargin
    +  )
    --- End diff --
    
    @maropu Currently we disallow it to be on the safe side. Lets take your 
example. When using the query option to pass on the query, we basically expect 
the users to supply 
    ```SQL
    select c0 p0, c1 p1, c2 p2 from t where c0 > 1
    ```
    In spark , we will parentesize the query and add in an alias to confirm to 
the table subquery syntax. Given the user input the above query, he could 
decide to qualify the partition column names with the table name. So he could 
do the following :
    
    ``` SQL
    al df = spark.read
      .format("jdbc")
      .option("driver", "org.postgresql.Driver")
      .option("url", "jdbc:postgresql://localhost:5432/postgres?user=maropu")
      .option("query", "select c0 p0, c1 p1, c2 p2 from t where c0 > 1")
      .option("partitionColumn", "t.p2")  ==> User qualifies the column names.
      .option("lowerBound", "1")
      .option("upperBound", "3")
      .option("numPartitions", "2")
      .load()
    ```
    
    In this case we will end up generating the query of the following form -
    ``` SQL
    select * from (select c0 p0, c1 p1, c2 p2 from t where c0 > 1) 
__SPARK_GEN_ALIAS where t.p2 >= 1 and t.p2 <=3
    ```
    However this would be an invalid query. In the query option, its possible 
to specify a complex query involving joins. 
    
    Thats the reason, we disallow it to be in safe side. In the dbtable option, 
users are responsible to explicitly specify the alias and would now how to 
qualify the partition columns.
    
    Lets see if we can improve this in future. If you have some ideas, please 
let us know.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...

Reply via email to