Github user davies commented on the pull request:
https://github.com/apache/spark/pull/3198#issuecomment-62487004
As we discussed in PR #1977, If we should keep the same semantics as
groupByKey(), user should can access the results in any pattern as before (such
as fetch the key and values, put them into a list, them return them). Also, the
returned RDD should be cachable.
In order to do this, we should have a ExternalIterator (similar to
ExternSorter), which will spill the values into disk if it's too large to hold
in memory.
In meanwhile, we should minimize the overhead of this, it should have
similar performance as before (for small/medium dataset).
If we just add this as a new API, then maybe we still need to fix the
groupByKey() in to future, to support hot key in it (or support join on hot
key).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]