apologies for asking yet again about spark memory assumptions, but i cant
seem to keep it in my head.
if i use PairRDDFunctions.cogroup, it returns for every key 2 iterables. do
the contents of these iterables have to fit in memory? or is the data
streamed?
Hi Koert,
cogroup is a transformation on RDD and it creates a cogroupRDD and then
perform some transformations on it. When later an action is called, the
compute() method of the cogroupRDD will be called. Roughly speaking, each
element in cogroupRDD is fetched one at a time. Thus the contents of