[ https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390771#comment-14390771 ]
Antony Mayi commented on SPARK-6334: ------------------------------------ bq. btw. I see based on the sourcecode checkpointing should be happening every 3 iterations - how comes I don't see any drops in the disk usage at least once every three iterations? it just seems to be growing constantly... which worries me that even more frequent checkpointing wont help... ok, I am now sure increasing the checkpointing interval is likely not going to help same as it is not helping now - the disk usage just grows even after 3x iterations. I just tried dirty hack - running parallel thread that forces GC every x minutes and suddenly I can notice the disk space gets cleared upon every three iterations when GC runs. see this pattern - first run without forcing GC and then another one where there is noticeable disk usage drops every three steps (ALS iterations): !gc.png! so really what's needed to get the shuffles cleaned upon checkpointing is forcing GC. this was my dirty hack: {code} from threading import Thread, Event class GC(Thread): def __init__(self, context, period=600): Thread.__init__(self) self.context = context self.period = period self.daemon = True self.stopped = Event() def stop(self): self.stopped.set() def run(self): self.stopped.clear() while not self.stopped.is_set(): self.stopped.wait(self.period) self.context._jvm.System.gc() sc.setCheckpointDir('/tmp') gc = GC(sc) gc.start() training = sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK) model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40) gc.stop() {code} > spark-local dir not getting cleared during ALS > ---------------------------------------------- > > Key: SPARK-6334 > URL: https://issues.apache.org/jira/browse/SPARK-6334 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Antony Mayi > Attachments: als-diskusage.png, gc.png > > > when running bigger ALS training spark spills loads of temp data into the > local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running > on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running > out of space (in my case I have 12TB of available disk capacity before > kicking off the ALS but it all gets used (and yarn kills the containers when > reaching 90%). > even with all recommended options (configuring checkpointing and forcing GC > when possible) it still doesn't get cleared. > here is my (pseudo)code (pyspark): > {code} > sc.setCheckpointDir('/tmp') > training = > sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK) > model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40) > sc._jvm.System.gc() > {code} > the training RDD has about 3.5 billions of items (~60GB on disk). after about > 6 hours the ALS will consume all 12TB of disk space in local-dir data and > gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using > 37 executors of 4 cores/28+4GB RAM each. > this is the graph of disk consumption pattern showing the space being all > eaten from 7% to 90% during the ALS (90% is when YARN kills the container): > !als-diskusage.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org