Github user nburoojy commented on the pull request:

    https://github.com/apache/spark/pull/8592#issuecomment-140436077
  
    @rxin the OOM concern is valid, but I see it as a distinct from the 
performance concerns I have.
    
    In theory, a collect_list over an ungrouped dataset with 10M nonnull rows 
will return a single row containing an array of 10M elements. In practice, the 
array of 10M elements may require more memory than heap available on a single 
node and the process will OOM. This is a fundamental limitation of the 
collect_list and collect_set interface, I think; just like the RDD functions 
`collect` and `CollectByKey`, we can't return bulk data through a single node. 
Based on this, I think the right solution (as you mentioned) is to write the 
caveat that the resulting arrays should be limited in size. Where should I 
document these functions?
    
    The performance issue I see is that we allocate new arrays for each element 
in `updateExpressions`. This could produce bulky garbage and the resulting GC 
could significantly slow things down. This is the performance issue I wanted to 
measure before writing a ArrayListODT to reuse lists. I think this would be a 
good next step once the core `collect_set` and `collect_list` functionality is 
merged.
    
    I'm certainly not a Spark SQL expert, so I'm not sure I fully understand 
your concerns or ideas for performance improvement. Could the operator-based 
collect_* produce the same array in a single row results as the 
aggregation-based collect_* functions while preventing OOM issues on the worker 
nodes?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to