[ https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212373#comment-17212373 ]
Ankush Chatterjee commented on SPARK-31338: ------------------------------------------- When read like this is used : - {code:java} jdbcRead = spark.read .option("fetchsize", fetchSize) .jdbc( url = s"${connectionURL}", table = s"${query}", columnName = s"${partKey}", lowerBound = lBound, upperBound = hBound, numPartitions = numParts, connectionProperties = connProps); {code} Spark generates multiple queries to read each partition, in the first partition, spark adds "or $column is null" in the where clause, this makes few databases do a full table scan having a heavy impact on performance (on columns with not null enabled). In JDBCRelation.scala :- {code:java} while (i < numPartitions) { val lBoundValue = boundValueToString(currentValue) val lBound = if (i != 0) s"$column >= $lBoundValue" else null currentValue += stride val uBoundValue = boundValueToString(currentValue) val uBound = if (i != numPartitions - 1) s"$column < $uBoundValue" else null val whereClause = if (uBound == null) { lBound } else if (lBound == null) { s"$uBound or $column is null" } else { s"$lBound AND $uBound" } ans += JDBCPartition(whereClause, i) i = i + 1 }{code} Is it feasible to add an option in JDBCOptions to enable/disable adding "or $column is null", as using a not null column is a commonplace usage when paritioning [~olkuznsmith] > Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for > NOT NULL table definition of partition key. > ---------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-31338 > URL: https://issues.apache.org/jira/browse/SPARK-31338 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.5 > Reporter: Mohit Dave > Priority: Major > > h2. *Our Use-case Details:* > While reading from a jdbc source using spark sql, we are using below read > format : > jdbc(url: String, table: String, columnName: String, lowerBound: Long, > upperBound: Long, numPartitions: Int, connectionProperties: Properties). > *Table defination :* > postgres=> \d lineitem_sf1000 > Table "public.lineitem_sf1000" > Column | Type | Modifiers > -----------------++---------------------------------- > *l_orderkey | bigint | not null* > l_partkey | bigint | not null > l_suppkey | bigint | not null > l_linenumber | bigint | not null > l_quantity | numeric(10,2) | not null > l_extendedprice | numeric(10,2) | not null > l_discount | numeric(10,2) | not null > l_tax | numeric(10,2) | not null > l_returnflag | character varying(1) | not null > l_linestatus | character varying(1) | not null > l_shipdate | character varying(29) | not null > l_commitdate | character varying(29) | not null > l_receiptdate | character varying(29) | not null > l_shipinstruct | character varying(25) | not null > l_shipmode | character varying(10) | not null > l_comment | character varying(44) | not null > Indexes: > "l_order_sf1000_idx" btree (l_orderkey) > > *Partition column* : l_orderkey > *numpartion* : 16 > h2. *Problem details :* > > {code:java} > SELECT > "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" > FROM (SELECT > l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment > FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 1500000001 AND > l_orderkey < 1875000001 {code} > 15 queries are generated with the above BETWEEN clauses. The last query looks > like this below: > {code:java} > SELECT > "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" > FROM (SELECT > l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment > FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 375000001 or > l_orderkey is null {code} > I*n the last query, we are trying to get the remaining records, along with > any data in the table for the partition key having NULL values.* > This hurts performance badly. While the first 15 SQLs took approximately 10 > minutes to execute, the last SQL with the NULL check takes 45 minutes because > it has to evaluate a second scan(OR clause) of the table for NULL values for > the partition key. > *Note that I have defined the partition key of the table to be NOT NULL, at > the database. Therefore, the SQL for the last partition need not have this > NULL check, Spark SQl should be able to avoid such condition and this Jira is > intended to fix this behavior.* > {code:java} > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org