Github user nburoojy commented on the pull request:
https://github.com/apache/spark/pull/8592#issuecomment-140436077
@rxin the OOM concern is valid, but I see it as a distinct from the
performance concerns I have.
In theory, a collect_list over an ungrouped dataset with 10M nonnull rows
will return a single row containing an array of 10M elements. In practice, the
array of 10M elements may require more memory than heap available on a single
node and the process will OOM. This is a fundamental limitation of the
collect_list and collect_set interface, I think; just like the RDD functions
`collect` and `CollectByKey`, we can't return bulk data through a single node.
Based on this, I think the right solution (as you mentioned) is to write the
caveat that the resulting arrays should be limited in size. Where should I
document these functions?
The performance issue I see is that we allocate new arrays for each element
in `updateExpressions`. This could produce bulky garbage and the resulting GC
could significantly slow things down. This is the performance issue I wanted to
measure before writing a ArrayListODT to reuse lists. I think this would be a
good next step once the core `collect_set` and `collect_list` functionality is
merged.
I'm certainly not a Spark SQL expert, so I'm not sure I fully understand
your concerns or ideas for performance improvement. Could the operator-based
collect_* produce the same array in a single row results as the
aggregation-based collect_* functions while preventing OOM issues on the worker
nodes?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]