james727 opened a new pull request #1511: URL: https://github.com/apache/arrow-datafusion/pull/1511
👋 hi all - first PR here. # Which issue does this PR close? This closes https://github.com/apache/arrow-datafusion/issues/1323. # Rationale for this change This provides an efficient way to aggregate unique values into an array. This is beneficial for aggregating low cardinality fields where `array_agg` may require significantly more memory than `set_agg`. I mainly implemented this as a way to get familiar the codebase. Though - I'm not 100% sure merging this actually makes sense if the goal of the project is to be as Postgres-like as possible. `set_agg` is supported by Presto (as linked in the issue above) and other DBMS, but Postgres neither supports `set_agg` nor `array_distinct`. The recommended approach to something like `set_agg` in Postgres is to use `array_agg`, unnest the values, select distinct, then use `array_agg` again on the output. That said - this does seem generally useful, and the Postgres approach is less efficient (and likely unworkable for certain datasets in a distributed environment). I'm interested to hear feedback from the maintainers on the above. # What changes are included in this PR? This includes the implementation of `set_agg` and a couple tests. It borrows heavily from the patch that implemented `array_agg`: https://github.com/apache/arrow-datafusion/pull/1300 # Open questions There's a couple specific points I could use feedback on: - Using `hashbrown::HashSet` for the accumulator instead of `std::collections::HashSet` - it seems this is preferred in the codebase. - Tests - this was actually the most difficult part of writing this as output ordering of `set_agg` is nondeterministic. I managed to hack it together but I'm sure there's an easier way (for both integration and unit tests). - Documentation - In general, what docs need to be updated? And given this is a divergence from Postgres, is there anywhere specific this should be called out? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
