[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

mateiz Mon, 10 Nov 2014 22:04:26 -0800

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3198#issuecomment-62506051
  
    Hey Sandy, unfortunately, I agree with Davies that this should support 
multiple accesses to the RDD in potentially different pattern. The problem is 
that your Iterable is not actually an Iterable, it can only be iterated once. 
Here are examples of where it will break:
    * Caching -- if you cache this RDD, you'll keep these once-only-iterable 
objects in the cache, which will then show no data the next time you read them
    * Even if the RDD itself is not cached, anything produced by a map(), 
filter(), etc on it might be, leading to the same problem
    * Joins -- if you join() this RDD with another one, you might see the same 
value in multiple key-value pairs even without doing caching
    
    We should design a solution for these that allows the iterables to be 
reused multiple times. It's annoying that it would have to spill to disk but 
it's better than giving these semantics. Users are already super-confused 
because hadoopFile reuses Writable objects.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-3461. Support external groupByKey using ...

Reply via email to