[jira] [Updated] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.

Mohit Dave (Jira) Fri, 03 Apr 2020 03:50:16 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mohit Dave updated SPARK-31338:
-------------------------------
    Description: 
h2. *Our Use-case Details:*

While reading from a jdbc source using spark sql, we are using below read 
format :

jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties).

*Table defination :* 
 postgres=> \d lineitem_sf1000
 Table "public.lineitem_sf1000"
 Column | Type | Modifiers
 -----------------++----------------------------------
 *l_orderkey | bigint | not null*
 l_partkey | bigint | not null
 l_suppkey | bigint | not null
 l_linenumber | bigint | not null
 l_quantity | numeric(10,2) | not null
 l_extendedprice | numeric(10,2) | not null
 l_discount | numeric(10,2) | not null
 l_tax | numeric(10,2) | not null
 l_returnflag | character varying(1) | not null
 l_linestatus | character varying(1) | not null
 l_shipdate | character varying(29) | not null
 l_commitdate | character varying(29) | not null
 l_receiptdate | character varying(29) | not null
 l_shipinstruct | character varying(25) | not null
 l_shipmode | character varying(10) | not null
 l_comment | character varying(44) | not null
 Indexes:
 "l_order_sf1000_idx" btree (l_orderkey)

 

*Partition column* : l_orderkey 

*numpartion* : 16 
h2. *Problem details :* 

 
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 1500000001 AND 
l_orderkey < 1875000001 {code}
15 queries are generated with the above BETWEEN clauses. The last query looks 
like this below:
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 375000001 or 
l_orderkey is null {code}
I*n the last query, we are trying to get the remaining records, along with any 
data in the table for the partition key having NULL values.*

This hurts performance badly. While the first 15 SQLs took approximately 10 
minutes to execute, the last SQL with the NULL check takes 45 minutes because 
it has to evaluate a second scan(OR clause) of the table for NULL values for 
the partition key.

*Note that I have defined the partition key of the table to be NOT NULL, at the 
database. Therefore, the SQL for the last partition need not have this NULL 
check, Spark SQl should be able to avoid such condition and this Jira is 
intended to fix this behavior.*
{code:java}
 {code}
 

 

  was:
*Our Use-case Details:*

While reading from a jdbc source using spark sql, we are using below read 
format :

jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties).

 

Table defination : 
postgres=> \d lineitem_sf1000
 Table "public.lineitem_sf1000"
 Column | Type | Modifiers
-----------------+-----------------------+-----------
 l_orderkey | bigint | not null
 l_partkey | bigint | not null
 l_suppkey | bigint | not null
 l_linenumber | bigint | not null
 l_quantity | numeric(10,2) | not null
 l_extendedprice | numeric(10,2) | not null
 l_discount | numeric(10,2) | not null
 l_tax | numeric(10,2) | not null
 l_returnflag | character varying(1) | not null
 l_linestatus | character varying(1) | not null
 l_shipdate | character varying(29) | not null
 l_commitdate | character varying(29) | not null
 l_receiptdate | character varying(29) | not null
 l_shipinstruct | character varying(25) | not null
 l_shipmode | character varying(10) | not null
 l_comment | character varying(44) | not null
Indexes:
 "l_order_sf1000_idx" btree (l_orderkey) 

 

Partition column : l_orderkey 

numpartion : 16

 

*Problem details :* 

 
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 1500000001 AND 
l_orderkey < 1875000001 {code}
15 queries are generated with the above BETWEEN clauses. The last query looks 
like this below:
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 375000001 or 
l_orderkey is null {code}
I*n the last query, we are trying to get the remaining records, along with any 
data in the table for the partition key having NULL values.*

This hurts performance badly. While the first 15 SQLs took approximately 10 
minutes to execute, the last SQL with the NULL check takes 45 minutes because 
it has to evaluate a second scan(OR clause) of the table for NULL values for 
the partition key.

*Note that I have defined the partition key of the table to be NOT NULL, at the 
database. Therefore, the SQL for the last partition need not have this NULL 
check, Spark SQl should be able to avoid such condition and this Jira is 
intended to fix this behavior.*
{code:java}
 {code}
 

 


> Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for 
> NOT NULL table definition of partition key.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31338
>                 URL: https://issues.apache.org/jira/browse/SPARK-31338
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.5
>            Reporter: Mohit Dave
>            Priority: Major
>
> h2. *Our Use-case Details:*
> While reading from a jdbc source using spark sql, we are using below read 
> format :
> jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties).
> *Table defination :* 
>  postgres=> \d lineitem_sf1000
>  Table "public.lineitem_sf1000"
>  Column | Type | Modifiers
>  -----------------++----------------------------------
>  *l_orderkey | bigint | not null*
>  l_partkey | bigint | not null
>  l_suppkey | bigint | not null
>  l_linenumber | bigint | not null
>  l_quantity | numeric(10,2) | not null
>  l_extendedprice | numeric(10,2) | not null
>  l_discount | numeric(10,2) | not null
>  l_tax | numeric(10,2) | not null
>  l_returnflag | character varying(1) | not null
>  l_linestatus | character varying(1) | not null
>  l_shipdate | character varying(29) | not null
>  l_commitdate | character varying(29) | not null
>  l_receiptdate | character varying(29) | not null
>  l_shipinstruct | character varying(25) | not null
>  l_shipmode | character varying(10) | not null
>  l_comment | character varying(44) | not null
>  Indexes:
>  "l_order_sf1000_idx" btree (l_orderkey)
>  
> *Partition column* : l_orderkey 
> *numpartion* : 16 
> h2. *Problem details :* 
>  
> {code:java}
> SELECT 
> "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
>  FROM (SELECT 
> l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
>  FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 1500000001 AND 
> l_orderkey < 1875000001 {code}
> 15 queries are generated with the above BETWEEN clauses. The last query looks 
> like this below:
> {code:java}
> SELECT 
> "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
>  FROM (SELECT 
> l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
>  FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 375000001 or 
> l_orderkey is null {code}
> I*n the last query, we are trying to get the remaining records, along with 
> any data in the table for the partition key having NULL values.*
> This hurts performance badly. While the first 15 SQLs took approximately 10 
> minutes to execute, the last SQL with the NULL check takes 45 minutes because 
> it has to evaluate a second scan(OR clause) of the table for NULL values for 
> the partition key.
> *Note that I have defined the partition key of the table to be NOT NULL, at 
> the database. Therefore, the SQL for the last partition need not have this 
> NULL check, Spark SQl should be able to avoid such condition and this Jira is 
> intended to fix this behavior.*
> {code:java}
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.

Reply via email to