cetra3 opened a new pull request, #20381: URL: https://github.com/apache/datafusion/pull/20381
## Which issue does this PR close? Sort of addresses https://github.com/apache/datafusion/issues/19386 ## Rationale for this change Compaction in TopK values currently uses some hard set heuristics to decide when to compact. Instead we can use the memory size of the batches as a bound. ## What changes are included in this PR? Adjusts TopK compaction to compact more aggressively, based upon memory size. ## Are these changes tested? Yes and a test has been added. ## Are there any user-facing changes? No ## Benchmarks I'm struggling (in this PR and other PRs...) to get some good reliable benchmarks through. However with this one it seems like it's *mostly* the same speed as `main` or a little faster in some cases: ``` + critcmp main topk_memory_batch group main topk_memory_batch ----- ---- ----------------- aggregate 10000000 time-series rows 1.05 45.5±2.13ms ? ?/sec 1.00 43.5±2.09ms ? ?/sec aggregate 10000000 worst-case rows 1.06 43.9±1.12ms ? ?/sec 1.00 41.5±1.03ms ? ?/sec distinct 10000000 rows asc [TopK] 1.00 4.3±0.25ms ? ?/sec 1.00 4.3±0.07ms ? ?/sec distinct 10000000 rows asc [no TopK] 1.00 42.9±1.83ms ? ?/sec 1.06 45.5±1.65ms ? ?/sec distinct 10000000 rows desc [TopK] 1.00 4.3±0.07ms ? ?/sec 1.01 4.3±0.06ms ? ?/sec distinct 10000000 rows desc [no TopK] 1.00 42.4±1.48ms ? ?/sec 1.05 44.7±1.88ms ? ?/sec top k=10 aggregate 10000000 time-series rows 1.13 10.6±0.54ms ? ?/sec 1.00 9.4±0.64ms ? ?/sec top k=10 aggregate 10000000 time-series rows [Utf8View] 1.08 11.0±0.65ms ? ?/sec 1.00 10.2±0.48ms ? ?/sec top k=10 aggregate 10000000 worst-case rows 1.04 16.8±1.47ms ? ?/sec 1.00 16.1±1.44ms ? ?/sec top k=10 aggregate 10000000 worst-case rows [Utf8View] 1.10 17.9±1.51ms ? ?/sec 1.00 16.2±1.16ms ? ?/sec top k=10 string aggregate 10000000 time-series rows [Utf8View] 1.11 9.1±0.36ms ? ?/sec 1.00 8.2±0.35ms ? ?/sec top k=10 string aggregate 10000000 time-series rows [Utf8] 1.12 7.8±0.48ms ? ?/sec 1.00 6.9±0.26ms ? ?/sec top k=10 string aggregate 10000000 worst-case rows [Utf8View] 1.06 8.6±0.21ms ? ?/sec 1.00 8.1±0.20ms ? ?/sec top k=10 string aggregate 10000000 worst-case rows [Utf8] 1.07 7.4±0.17ms ? ?/sec 1.00 6.9±0.13ms ? ?/sec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
