[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

davies Mon, 10 Nov 2014 17:23:01 -0800

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/3198#issuecomment-62487004
  
    As we discussed in PR #1977, If we should keep the same semantics as 
groupByKey(), user should can access the results in any pattern as before (such 
as fetch the key and values, put them into a list, them return them). Also, the 
returned RDD should be cachable.
    
    In order to do this, we should have a ExternalIterator (similar to 
ExternSorter), which will spill the values into disk if it's too large to hold 
in memory.
    
    In meanwhile, we should minimize the overhead of this, it should have 
similar performance as before (for small/medium dataset).
    
    If we just add this as a new API, then maybe we still need to fix the 
groupByKey() in to future, to support hot key in it (or support join on hot 
key).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

Reply via email to