Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/3198#issuecomment-62502932
Will take a look at #1977.
I believe that the most common uses for groupByKey, like writing out
partitioned tables, involve iterating over each group a single time and without
backtracking to previous groups. For this access pattern, we can achieve
better performance by avoiding the extra serialization/deserialization and
trips to disk that an ExternalIterator would require.
I understand that we can't change the semantics of groupByKey, and agree
that adding spilling there could be useful. But if we can add a transformation
that supports the common case without the extra I/O overhead, I think that's
equally worthwhile.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]