[ 
https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212373#comment-17212373
 ] 

Ankush Chatterjee commented on SPARK-31338:
-------------------------------------------

When read like this is used : -
{code:java}
jdbcRead = spark.read
.option("fetchsize", fetchSize)
.jdbc(
url = s"${connectionURL}",
table = s"${query}",
columnName = s"${partKey}",
lowerBound = lBound,
upperBound = hBound,
numPartitions = numParts,
connectionProperties = connProps);
{code}
 

Spark generates multiple queries to read each partition, in the first 
partition, spark adds "or $column is null" in the where clause, this makes few 
databases do a full table scan having a heavy impact on performance (on columns 
with not null enabled).

In JDBCRelation.scala :- 
 
{code:java}
while (i < numPartitions) {
      val lBoundValue = boundValueToString(currentValue)
      val lBound = if (i != 0) s"$column >= $lBoundValue" else null
      currentValue += stride
      val uBoundValue = boundValueToString(currentValue)
      val uBound = if (i != numPartitions - 1) s"$column < $uBoundValue" else 
null
      val whereClause =
        if (uBound == null) {
          lBound
        } else if (lBound == null) {
          s"$uBound or $column is null"
        } else {
          s"$lBound AND $uBound"
        }
      ans += JDBCPartition(whereClause, i)
      i = i + 1
    }{code}
 

Is it feasible to add an option in JDBCOptions to enable/disable adding  "or 
$column is null", as using a not null column is a commonplace usage when 
paritioning

 

[~olkuznsmith]

 

> Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for 
> NOT NULL table definition of partition key.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31338
>                 URL: https://issues.apache.org/jira/browse/SPARK-31338
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.5
>            Reporter: Mohit Dave
>            Priority: Major
>
> h2. *Our Use-case Details:*
> While reading from a jdbc source using spark sql, we are using below read 
> format :
> jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties).
> *Table defination :* 
>  postgres=> \d lineitem_sf1000
>  Table "public.lineitem_sf1000"
>  Column | Type | Modifiers
>  -----------------++----------------------------------
>  *l_orderkey | bigint | not null*
>  l_partkey | bigint | not null
>  l_suppkey | bigint | not null
>  l_linenumber | bigint | not null
>  l_quantity | numeric(10,2) | not null
>  l_extendedprice | numeric(10,2) | not null
>  l_discount | numeric(10,2) | not null
>  l_tax | numeric(10,2) | not null
>  l_returnflag | character varying(1) | not null
>  l_linestatus | character varying(1) | not null
>  l_shipdate | character varying(29) | not null
>  l_commitdate | character varying(29) | not null
>  l_receiptdate | character varying(29) | not null
>  l_shipinstruct | character varying(25) | not null
>  l_shipmode | character varying(10) | not null
>  l_comment | character varying(44) | not null
>  Indexes:
>  "l_order_sf1000_idx" btree (l_orderkey)
>  
> *Partition column* : l_orderkey 
> *numpartion* : 16 
> h2. *Problem details :* 
>  
> {code:java}
> SELECT 
> "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
>  FROM (SELECT 
> l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
>  FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 1500000001 AND 
> l_orderkey < 1875000001 {code}
> 15 queries are generated with the above BETWEEN clauses. The last query looks 
> like this below:
> {code:java}
> SELECT 
> "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
>  FROM (SELECT 
> l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
>  FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 375000001 or 
> l_orderkey is null {code}
> I*n the last query, we are trying to get the remaining records, along with 
> any data in the table for the partition key having NULL values.*
> This hurts performance badly. While the first 15 SQLs took approximately 10 
> minutes to execute, the last SQL with the NULL check takes 45 minutes because 
> it has to evaluate a second scan(OR clause) of the table for NULL values for 
> the partition key.
> *Note that I have defined the partition key of the table to be NOT NULL, at 
> the database. Therefore, the SQL for the last partition need not have this 
> NULL check, Spark SQl should be able to avoid such condition and this Jira is 
> intended to fix this behavior.*
> {code:java}
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to