c21 commented on a change in pull request #34291:
URL: https://github.com/apache/spark/pull/34291#discussion_r734830379
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
##########
@@ -298,17 +299,22 @@ private[sql] case class JDBCRelation(
requiredColumns: Array[String],
finalSchema: StructType,
filters: Array[Filter],
- groupByColumns: Option[Array[String]]): RDD[Row] = {
+ groupByColumns: Option[Array[String]],
+ limit: Option[Limit]): RDD[Row] = {
+ // If limit is pushed down, only a limited number of rows will be
returned. PartitionInfo will
+ // be ignored and the query will be done in one task.
Review comment:
ah I see, thanks for explanation @huaxingao. In this case, maybe we can
push the original limit to each queries?
```
SELECT * FROM h2.test.employee WHERE dept <2 LIMIT 6
SELECT * FROM h2.test.employee WHERE dept >= 2 AND dept <4 LIMIT 6
SELECT * FROM h2.test.employee WHERE dept >= 4 LIMIT 6
```
Spark will anyway to do LIMIT again after reading JDBC data source. So we
don't have the correctness problem, and the performance will still be better
than not pushing down limit. This is not urgent for this PR anyway.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]