Is your data skewed? What happens if you do rdd.count()? On 4 Aug 2015 05:49, "Jasleen Kaur" <jasleenkaur1...@gmail.com> wrote:
> I am executing a spark job on a cluster as a yarn-client(Yarn cluster not > an option due to permission issues). > > - num-executors 800 > - spark.akka.frameSize=1024 > - spark.default.parallelism=25600 > - driver-memory=4G > - executor-memory=32G. > - My input size is around 1.5TB. > > My problem is when I execute rdd.saveAsTextFile(outputPath, > classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not > an option, I have tried saveAsSequenceFile with GZIP, > saveAsNewAPIHadoopFile with same result), I get heap space issue. On the > other hand if I execute rdd.take(1). I get no such issue. So I am assuming > that issue is due to write. >