wangyum opened a new pull request, #37140:
URL: https://github.com/apache/spark/pull/37140

   ### What changes were proposed in this pull request?
   
   This PR changes the default value of 
`spark.sql.execution.topKSortFallbackThreshold` to 800000 base on benchmark.
   
   Benchmark code:
   ```sql
   create table benchmark_limit using parquet as select id  as id, id as a, id 
as b, id as c, id as d  from range(21474836320L);
   select * from benchmark_limit  order by id limit limit_value;
   ```
   Benchmark config:
   ```
   spark.driver.memory       60g
   spark.executor.memory     45g
   spark.executor.instances  100
   spark.executor.cores      18
   spark.default.parallelism 300
   ```
   
   Benchmark result:
   
   limit_value | shuffle+sort(seconds) | top-k(seconds)
   -- | -- | --
   100000 | 63.782  |  47.244
   500000 | 152.511 | 118.333
   800000 | 159.539 | 154.795
   900000 | 160.903 | 187.001
   1000000 | 162.798 | 338.632
   5000000 | 256.813 |  660+
   
   
   ### Why are the changes needed?
   
   `TakeOrderedAndProject` do not always has benefit, especially when the limit 
is a large number.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   N/A.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to