Re: Spark not releasing shuffle files in time (with very large heap)

naresh Goud Thu, 22 Feb 2018 21:17:39 -0800

Got it. I understood issue in different way.



On Thu, Feb 22, 2018 at 9:19 PM Keith Chapman <keithgchap...@gmail.com>
wrote:

> My issue is that there is not enough pressure on GC, hence GC is not
> kicking in fast enough to delete the shuffle files of previous iterations.
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
> On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud <nareshgoud.du...@gmail.com>
> wrote:
>
>> It would be very difficult to tell without knowing what is your
>> application code doing, what kind of transformation/actions performing.
>> From my previous experience tuning application code which avoids
>> unnecessary objects reduce pressure on GC.
>>
>>
>> On Thu, Feb 22, 2018 at 2:13 AM, Keith Chapman <keithgchap...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm benchmarking a spark application by running it for multiple
>>> iterations, its a benchmark thats heavy on shuffle and I run it on a local
>>> machine with a very large hear (~200GB). The system has a SSD. When running
>>> for 3 to 4 iterations I get into a situation that I run out of disk space
>>> on the /tmp directory. On further investigation I was able to figure out
>>> that the reason for this is that the shuffle files are still around,
>>> because I have a very large hear GC has not happen and hence the shuffle
>>> files are not deleted. I was able to confirm this by lowering the heap size
>>> and I see GC kicking in more often and the size of /tmp stays under
>>> control. Is there any way I could configure spark to handle this issue?
>>>
>>> One option that I have is to have GC run more often by
>>> setting spark.cleaner.periodicGC.interval to a much lower value. Is there a
>>> cleaner solution?
>>>
>>> Regards,
>>> Keith.
>>>
>>> http://keith-chapman.com
>>>
>>
>>
>

Re: Spark not releasing shuffle files in time (with very large heap)

Reply via email to