Don't collect() - that pulls all data into memory. Use count(). On Tue, Apr 19, 2022 at 5:34 AM wilson <i...@bigcount.xyz> wrote:
> Hello, > > Do you know for a big dataset why the general RDD job can be done, but > the collect() failed due to memory overflow? > > for instance, for a dataset which has xxx million of items, this can be > done well: > > scala> rdd.map { x => x.split(",") }.map{ x => (x(5).toString, > x(6).toDouble) }.groupByKey.mapValues(x => > x.sum/x.size).sortBy(-_._2).take(20) > > > But in the final stage I issued this command and it got: > > scala> rdd.collect.size > 22/04/19 18:26:52 ERROR Executor: Exception in task 13.0 in stage 44.0 > (TID 349) > java.lang.OutOfMemoryError: Java heap space > > > Thank you. > wilson > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >