Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1977#issuecomment-55958078
  
    Summarizing some of our in-person discussion (@davies, let me know if I've 
made any mistakes here!):
    
    `GroupByKey` and `SameKey` work together to address the different ways that 
users might consume the result of the group-by.  The design goal here is 
laziness: don't load all of a group's values into a list unless the user calls 
something like `list()` on the values iterator.
    
    If a user does something like `mapValues(len)`, the first call to 
`GroupByKey.next()` will return a `(key, SameKey)` object.  If we then attempt 
to iterate over the values, `SameKey.next()` will read values from the group 
until it hits a value that belongs to a different group.  In this case, 
`SameKey.next()` will place the value into `GroupByKey. next_item` so that it's 
read on the next call to `GroupByKey.next()`.
    
    We also support cases consumes the entire `GroupByKey` iterator and then 
reads individual groups (e.g. call `glom()` on the results of the 
`groupByKey()`).  In these cases, we unroll the group values into a list 
(stored inside of `SameKey`).  If this unrolling would cause an OOM, `SameKey` 
spills to disk and reads the values back once the user consumes the values.  
This feature helps to prevent OOMs when users write things like 
`x.groupByKey().count()` to count the number of distinct keys (there are _much_ 
more efficient ways of doing this, but at least now we won't fail if someone 
tries to do things the bad way).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to