Rachelint commented on PR #15591: URL: https://github.com/apache/datafusion/pull/15591#issuecomment-2853367261
@jayzhan211 @alamb I guess I nearly understand why It is related to `target_partitions`, we maintain the `intermediate results` respectively in each `partition`. So when `target_partitions` is larger, amount of `intermediate results` in one `partition` is smaller. And when `intermediate results` assigned to `partition` is small enough, the blocked approach will almost make no difference (assume the extreme case, the `Vec` used to store results never resize). Here is my benchmark result in a production machine (x86_64 + 16core + 3699.178CPU MHz + 64G RAM) with different `target_partitions` - target_partitions = 4 (amazing improvement!) ``` // blocked approach Q7: SELECT "WatchID", MIN("ResolutionWidth") as wmin, MAX("ResolutionWidth") as wmax, SUM("IsRefresh") as srefresh FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; Query 7 iteration 0 took 4996.7 ms and returned 10 rows Query 7 iteration 1 took 5177.6 ms and returned 10 rows Query 7 iteration 2 took 5335.4 ms and returned 10 rows Query 7 iteration 3 took 5232.7 ms and returned 10 rows Query 7 iteration 4 took 5350.5 ms and returned 10 rows Query 7 avg time: 5218.59 ms // main Q7: SELECT "WatchID", MIN("ResolutionWidth") as wmin, MAX("ResolutionWidth") as wmax, SUM("IsRefresh") as srefresh FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; Query 7 iteration 0 took 8484.9 ms and returned 10 rows Query 7 iteration 1 took 8194.4 ms and returned 10 rows Query 7 iteration 2 took 8317.9 ms and returned 10 rows Query 7 iteration 3 took 8183.2 ms and returned 10 rows Query 7 iteration 4 took 8225.7 ms and returned 10 rows Query 7 avg time: 8281.19 ms ``` - target_partitions = 8 (emmm... some improvement) ``` // blocked approach Q7: SELECT "WatchID", MIN("ResolutionWidth") as wmin, MAX("ResolutionWidth") as wmax, SUM("IsRefresh") as srefresh FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; Query 7 iteration 0 took 2599.2 ms and returned 10 rows Query 7 iteration 1 took 2857.9 ms and returned 10 rows Query 7 iteration 2 took 3046.3 ms and returned 10 rows Query 7 iteration 3 took 2766.0 ms and returned 10 rows Query 7 iteration 4 took 2830.8 ms and returned 10 rows Query 7 avg time: 2820.04 ms // main Q7: SELECT "WatchID", MIN("ResolutionWidth") as wmin, MAX("ResolutionWidth") as wmax, SUM("IsRefresh") as srefresh FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; Query 7 iteration 0 took 3820.0 ms and returned 10 rows Query 7 iteration 1 took 3716.6 ms and returned 10 rows Query 7 iteration 2 took 3728.3 ms and returned 10 rows Query 7 iteration 3 took 3628.2 ms and returned 10 rows Query 7 iteration 4 took 3912.6 ms and returned 10 rows Query 7 avg time: 3761.15 ms ``` - target_partitions = 32 (sadly, almost no improvement...) ``` // blocked approach Q7: SELECT "WatchID", MIN("ResolutionWidth") as wmin, MAX("ResolutionWidth") as wmax, SUM("IsRefresh") as srefresh FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; Query 7 iteration 0 took 1383.7 ms and returned 10 rows Query 7 iteration 1 took 1274.3 ms and returned 10 rows Query 7 iteration 2 took 1321.7 ms and returned 10 rows Query 7 iteration 3 took 1308.1 ms and returned 10 rows Query 7 iteration 4 took 1310.6 ms and returned 10 rows Query 7 avg time: 1319.68 ms // main Q7: SELECT "WatchID", MIN("ResolutionWidth") as wmin, MAX("ResolutionWidth") as wmax, SUM("IsRefresh") as srefresh FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; Query 7 iteration 0 took 1440.7 ms and returned 10 rows Query 7 iteration 1 took 1430.3 ms and returned 10 rows Query 7 iteration 2 took 1357.5 ms and returned 10 rows Query 7 iteration 3 took 1352.6 ms and returned 10 rows Query 7 iteration 4 took 1342.2 ms and returned 10 rows Query 7 avg time: 1384.64 ms ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org