Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/11688#issuecomment-196382153
@mccheah I am all on-board with adding a better
`collect_list`/`collect_set` implementation. I have submitted a similar PR
recently (using mutable `ArrayData` classes):
https://github.com/apache/spark/pull/11004.
I have one question. What is the exact use case for this? Collecting
elements in a Dimension of some sorts, or turn a flat DataFrame into a
hierarchical DataFrame? For the latter it might be better to create a custom
physical operator, much like the current
`org.apache.spark.sql.execution.MapGroups` class.
As for the your PR the current GC churn will be quite high because you
create a new Array and ArrayData instance after each update (that is why I
created custom `ArrayData` classes). A way to counter this would be taking the
current `HiveUDAFFunction` approach by disabling `supportsPartial` (what you
already do) and not using the buffer at all (this banks on the aggregation
operator being sort-based).
As for performance. I am curious to see the differences between the
declarative, imperative and Hive approaches. I cannot imagine imperative to be
so much smaller because we don't do type conversions for imperative operators
(only if you use an actual UDAF).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]