Re: Long-running job cleanup

2014-12-31 Thread Ganelin, Ilya
...@gmail.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Long-running job cleanup Hi Patrick, to follow up on the below discussion, I am including a short code snippet that produces the problem on 1.1. This is kind of stupid

Re: Long-running job cleanup

2014-12-30 Thread Ganelin, Ilya
user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Long-running job cleanup Hi Patrick - is that cleanup present in 1.1? The overhead I am talking about is with regards to what I believe is shuffle related metadata. If I watch the execution log I see small broadcast variables created

Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell
What do you mean when you say the overhead of spark shuffles start to accumulate? Could you elaborate more? In newer versions of Spark shuffle data is cleaned up automatically when an RDD goes out of scope. It is safe to remove shuffle data at this point because the RDD can no longer be

Re: Long-running job cleanup

2014-12-28 Thread Ilya Ganelin
Hi Patrick - is that cleanup present in 1.1? The overhead I am talking about is with regards to what I believe is shuffle related metadata. If I watch the execution log I see small broadcast variables created for every stage of execution, a few KB at a time, and a certain number of MB remaining

Re: Long-running job cleanup

2014-12-25 Thread Ilya Ganelin
Hello all - can anyone please offer any advice on this issue? -Ilya Ganelin On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long,

Long-running job cleanup

2014-12-22 Thread Ganelin, Ilya
Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long, eventually the overhead of spark shuffles starts to accumulate culminating in the driver starting to swap. I am aware of the spark.cleanup.tll parameter that