Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/3198#issuecomment-62508663
All good points. Will close this for now.
Longer term, it worries me that Spark wouldn't be able to provide an
operator that gives comparable performance to what other ETL-focused frameworks
like MR or Tez can do. The hit is not just extra I/O, but memory and GC
pressure from keeping a large group in memory.
It seems like there must be a general solution to this and the hadoopFile
problem that would allow us to unroll single-sequential-access collections
before they're serialized or cached.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]