sunchao commented on issue #4973: URL: https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1460948372
I'm actually working on some POC to improve the hash aggregation performance, following a very similar approach. The only difference is that I'm not using `Rows` in the `update_batch` API, but rather the row format defined in DF: it seems the `Rows` in `arrow-rs` incurs extra costs because it is designed for sort and requires order preserving, and the cost is especially high for dictionary encoded arrays. The approach requires quite a few API changes. I was able to see a big improvement for simple cases at least - haven't done comprehensive benchmarks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
