Got it. I understood issue in different way.
On Thu, Feb 22, 2018 at 9:19 PM Keith Chapman <keithgchap...@gmail.com> wrote: > My issue is that there is not enough pressure on GC, hence GC is not > kicking in fast enough to delete the shuffle files of previous iterations. > > Regards, > Keith. > > http://keith-chapman.com > > On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud <nareshgoud.du...@gmail.com> > wrote: > >> It would be very difficult to tell without knowing what is your >> application code doing, what kind of transformation/actions performing. >> From my previous experience tuning application code which avoids >> unnecessary objects reduce pressure on GC. >> >> >> On Thu, Feb 22, 2018 at 2:13 AM, Keith Chapman <keithgchap...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I'm benchmarking a spark application by running it for multiple >>> iterations, its a benchmark thats heavy on shuffle and I run it on a local >>> machine with a very large hear (~200GB). The system has a SSD. When running >>> for 3 to 4 iterations I get into a situation that I run out of disk space >>> on the /tmp directory. On further investigation I was able to figure out >>> that the reason for this is that the shuffle files are still around, >>> because I have a very large hear GC has not happen and hence the shuffle >>> files are not deleted. I was able to confirm this by lowering the heap size >>> and I see GC kicking in more often and the size of /tmp stays under >>> control. Is there any way I could configure spark to handle this issue? >>> >>> One option that I have is to have GC run more often by >>> setting spark.cleaner.periodicGC.interval to a much lower value. Is there a >>> cleaner solution? >>> >>> Regards, >>> Keith. >>> >>> http://keith-chapman.com >>> >> >> >