Dandandan commented on PR #21654: URL: https://github.com/apache/datafusion/pull/21654#issuecomment-4255029687
> Thanks for tackling this @Dandandan! It was on my to-do list. We have a workload in Comet that spends the vast majority of its time in the hash agg just resizing and rehashing (almost 1 billion unique values), so I wanted to take a look at using Statistics to preallocate. The description seems perfect to me! > > <img alt="Screenshot 2026-04-15 at 3 48 02 PM" width="2000" height="86" src="https://private-user-images.githubusercontent.com/6855576/578825708-2d2583a5-f7c5-43a4-8004-1a0181e4ae48.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzYyODI4ODAsIm5iZiI6MTc3NjI4MjU4MCwicGF0aCI6Ii82ODU1NTc2LzU3ODgyNTcwOC0yZDI1ODNhNS1mN2M1LTQzYTQtODAwNC0xYTAxODFlNGFlNDgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI2MDQxNSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNjA0MTVUMTk0OTQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDYwMTgzZjU4NzdiZDYyNzg4NjhjMDRjNDQ4NjVhZTA5YjVjYzE5ZTUxYWU1Mjg5YWMxNjAzYjNhMTQ4ZDljYiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmcmVzcG9uc2UtY29udGVudC10eXBlPWltYWdlJTJGcG5nIn0.IcFyDA8TuyTCU0lS4r2OucpI5_kUCrXuddNHo8WHgDc"> Yeah the hashtable resizing is a pretty costly one (prob terms of cache / brach mispredicts), especially as the table grows larger. > 1 billion unique values As there also is a risk of estimating NDV too high, I added a cap for 128K rows (I think it should be configurable). We should remove the upper limit if the NDV stats are exact as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
