wangyum opened a new pull request, #37140: URL: https://github.com/apache/spark/pull/37140
### What changes were proposed in this pull request? This PR changes the default value of `spark.sql.execution.topKSortFallbackThreshold` to 800000 base on benchmark. Benchmark code: ```sql create table benchmark_limit using parquet as select id as id, id as a, id as b, id as c, id as d from range(21474836320L); select * from benchmark_limit order by id limit limit_value; ``` Benchmark config: ``` spark.driver.memory 60g spark.executor.memory 45g spark.executor.instances 100 spark.executor.cores 18 spark.default.parallelism 300 ``` Benchmark result: limit_value | shuffle+sort(seconds) | top-k(seconds) -- | -- | -- 100000 | 63.782 | 47.244 500000 | 152.511 | 118.333 800000 | 159.539 | 154.795 900000 | 160.903 | 187.001 1000000 | 162.798 | 338.632 5000000 | 256.813 | 660+ ### Why are the changes needed? `TakeOrderedAndProject` do not always has benefit, especially when the limit is a large number. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
