Correct, brute force clean up is not useful. Since Spark 1.0, Spark can do
automatic cleanup of files based on which RDDs are used/garbage collected
by JVM. That would be the best way, but depends on the JVM GC
characteristics. If you force a GC periodically in the driver that might
help you get
For the last question, you can trigger GC in JVM from Python by :
sc._jvm.System.gc()
On Mon, Feb 16, 2015 at 4:08 PM, Antony Mayi
antonym...@yahoo.com.invalid wrote:
thanks, that looks promissing but can't find any reference giving me more
details - can you please point me to something? Also
thanks, that looks promissing but can't find any reference giving me more
details - can you please point me to something? Also is it possible to force GC
from pyspark (as I am using pyspark)?
thanks,Antony.
On Monday, 16 February 2015, 21:05, Tathagata Das
tathagata.das1...@gmail.com
spark.cleaner.ttl is not the right way - seems to be really designed for
streaming. although it keeps the disk usage under control it also causes loss
of rdds and broadcasts that are required later leading to crash.
is there any other way?thanks,Antony.
On Sunday, 15 February 2015, 21:42,
Hi,
I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about
3 billions of ratings and I am doing several trainImplicit() runs in loop
within one spark session. I have four node cluster with 3TB disk space on each.
before starting the job there is less then 8% of the disk
spark.cleaner.ttl ?
On Sunday, 15 February 2015, 18:23, Antony Mayi antonym...@yahoo.com
wrote:
Hi,
I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about
3 billions of ratings and I am doing several trainImplicit() runs in loop
within one spark session.