Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3198#issuecomment-62508663
  
    All good points.  Will close this for now.
    
    Longer term, it worries me that Spark wouldn't be able to provide an 
operator that gives comparable performance to what other ETL-focused frameworks 
like MR or Tez can do.  The hit is not just extra I/O, but memory and GC 
pressure from keeping a large group in memory.
    
    It seems like there must be a general solution to this and the hadoopFile 
problem that would allow us to unroll single-sequential-access collections 
before they're serialized or cached.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to