[GitHub] spark pull request: SPARK-13335 Use declarative aggregate for coll...

hvanhovell Mon, 14 Mar 2016 08:57:17 -0700

Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/11688#issuecomment-196382153
  
    @mccheah I am all on-board with adding a better 
`collect_list`/`collect_set` implementation. I have submitted a similar PR 
recently (using mutable `ArrayData` classes): 
https://github.com/apache/spark/pull/11004. 
    
    I have one question. What is the exact use case for this? Collecting 
elements in a Dimension of some sorts, or turn a flat DataFrame into a 
hierarchical DataFrame? For the latter it might be better to create a custom 
physical operator, much like the current 
`org.apache.spark.sql.execution.MapGroups` class.
    
    As for the your PR the current GC churn will be quite high because you 
create a new Array and ArrayData instance after each update (that is why I 
created custom `ArrayData` classes). A way to counter this would be taking the 
current `HiveUDAFFunction` approach by disabling `supportsPartial` (what you 
already do) and not using the buffer at all (this banks on the aggregation 
operator being sort-based). 
    
    As for performance. I am curious to see the differences between the 
declarative, imperative and Hive approaches. I cannot imagine imperative to be 
so much smaller because we don't do type conversions for imperative operators 
(only if you use an actual UDAF).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-13335 Use declarative aggregate for coll...

Reply via email to