[GitHub] [spark] wangyum commented on a change in pull request #31691: [SPARK-34575][SQL] Push down limit through window when partitionSpec is empty

GitBox Thu, 04 Mar 2021 00:19:01 -0800


wangyum commented on a change in pull request #31691:
URL: https://github.com/apache/spark/pull/31691#discussion_r587249157




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -613,6 +613,16 @@ object LimitPushDown extends Rule[LogicalPlan] {
         case _ => join
       }
       LocalLimit(exp, newJoin)
+
+    // Adding an extra Limit below WINDOW when there is only one 
RankLike/RowNumber
+    // window function and partitionSpec is empty.
+    case LocalLimit(limitExpr @ IntegerLiteral(limit),
+        window @ Window(Seq(Alias(WindowExpression(_: RankLike | _: RowNumber,

Review comment:
       OK. Add support for multiple window functions if the partitionSpec of 
all window functions is empty and the same order is used. For example:
   ```scala
   val numRows = 10
   spark.range(numRows).selectExpr("IF (id % 2 = 0, null, id) AS a", 
s"${numRows} - id AS b", "id AS c").write.saveAsTable("t1")
   spark.sql("SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn, DENSE_RANK() 
OVER(ORDER BY a) AS rk FROM t1 LIMIT 5").explain("cost")
   ```
   Before:
   ```
   GlobalLimit 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
   +- LocalLimit 5, Statistics(sizeInBytes=2.4 KiB)
      +- Window [row_number() windowspecdefinition(a#16L ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#11, 
dense_rank(a#16L) windowspecdefinition(a#16L ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rk#12], [a#16L ASC NULLS FIRST], Statistics(sizeInBytes=2.4 KiB)
         +- Relation default.t1[a#16L,b#17L,c#18L] parquet, 
Statistics(sizeInBytes=1994.0 B)
   ```
   
   After:
   ```
   Window [row_number() windowspecdefinition(a#16L ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#11, 
dense_rank(a#16L) windowspecdefinition(a#16L ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rk#12], [a#16L ASC NULLS FIRST], Statistics(sizeInBytes=200.0 B)
   +- GlobalLimit 5, Statistics(sizeInBytes=160.0 B, rowCount=5)
      +- LocalLimit 5, Statistics(sizeInBytes=1994.0 B)
         +- Sort [a#16L ASC NULLS FIRST], true, Statistics(sizeInBytes=1994.0 B)
            +- Relation default.t1[a#16L,b#17L,c#18L] parquet, 
Statistics(sizeInBytes=1994.0 B)
   ```
   ---
   
   But we do not support pushdown if the partitionSpec of all window functions 
is empty but the order is different :
   ```scala
   spark.sql("SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn, DENSE_RANK() 
OVER(ORDER BY b) AS dr, RANK() OVER(ORDER BY c) AS rk FROM t1 LIMIT 
5").explain("cost")
   ```
   
   ```
   GlobalLimit 5, Statistics(sizeInBytes=220.0 B, rowCount=5)
   +- LocalLimit 5, Statistics(sizeInBytes=2.7 KiB)
      +- Window [rank(c#21L) windowspecdefinition(c#21L ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rk#13], [c#21L ASC NULLS FIRST], Statistics(sizeInBytes=2.7 KiB)
         +- Window [dense_rank(b#20L) windowspecdefinition(b#20L ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
dr#12], [b#20L ASC NULLS FIRST], Statistics(sizeInBytes=2.4 KiB)
            +- Window [row_number() windowspecdefinition(a#19L ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rn#11], [a#19L ASC NULLS FIRST], Statistics(sizeInBytes=2.2 KiB)
               +- Relation default.t1[a#19L,b#20L,c#21L] parquet, 
Statistics(sizeInBytes=1994.0 B)
   ```
   
   We should push down limit to the last Window if we want support this case 
because push down limit to first Window can not benefit query. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wangyum commented on a change in pull request #31691: [SPARK-34575][SQL] Push down limit through window when partitionSpec is empty

Reply via email to