[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

sryza Mon, 10 Nov 2014 21:10:33 -0800

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3198#issuecomment-62502932
  
    Will take a look at #1977.
    
    I believe that the most common uses for groupByKey, like writing out 
partitioned tables, involve iterating over each group a single time and without 
backtracking to previous groups.  For this access pattern, we can achieve 
better performance by avoiding the extra serialization/deserialization and 
trips to disk that an ExternalIterator would require. 
    
    I understand that we can't change the semantics of groupByKey, and agree 
that adding spilling there could be useful. But if we can add a transformation 
that supports the common case without the extra I/O overhead, I think that's 
equally worthwhile.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

Reply via email to