Mohit Dave created SPARK-31338:
----------------------------------
Summary: Spark SQL JDBC Data Source partitioned read : Spark SQL
does not honor for NOT NULL table definition of partition key.
Key: SPARK-31338
URL: https://issues.apache.org/jira/browse/SPARK-31338
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.4.5
Reporter: Mohit Dave
*Our Use-case Details:*
While reading from a jdbc source using spark sql, we are using below read
format :
jdbc(url: String, table: String, columnName: String, lowerBound: Long,
upperBound: Long, numPartitions: Int, connectionProperties: Properties).
Table defination :
postgres=> \d lineitem_sf1000
Table "public.lineitem_sf1000" Column | Type
| Modifiers
-----------------+-----------------------+-----------
l_orderkey | bigint | not *null* l_partkey | bigint
| not null l_suppkey | bigint | not null
l_linenumber | bigint | not null l_quantity |
numeric(10,2) | not null l_extendedprice | numeric(10,2) | not
null l_discount | numeric(10,2) | not null l_tax |
numeric(10,2) | not null l_returnflag | character varying(1) | not
null l_linestatus | character varying(1) | not null l_shipdate |
character varying(29) | not null l_commitdate | character varying(29) | not
null l_receiptdate | character varying(29) | not null l_shipinstruct |
character varying(25) | not null l_shipmode | character varying(10) | not
null l_comment | character varying(44) | not nullIndexes:
"l_order_sf1000_idx" btree (l_orderkey)
Partition column : l_orderkey
numpartion : 16
*Problem details :*
{code:java}
SELECT
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
FROM (SELECT
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 1500000001 AND
l_orderkey < 1875000001 {code}
15 queries are generated with the above BETWEEN clauses. The last query looks
like this below:
{code:java}
SELECT
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
FROM (SELECT
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 375000001 or
l_orderkey is null {code}
I*n the last query, we are trying to get the remaining records, along with any
data in the table for the partition key having NULL values.*
This hurts performance badly. While the first 15 SQLs took approximately 10
minutes to execute, the last SQL with the NULL check takes 45 minutes because
it has to evaluate a second scan(OR clause) of the table for NULL values for
the partition key.
*Note that I have defined the partition key of the table to be NOT NULL, at the
database. Therefore, the SQL for the last partition need not have this NULL
check, Spark SQl should be able to avoid such condition and this Jira is
intended to fix this behavior.*
{code}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]