[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361715#comment-14361715
 ] 

Antony Mayi edited comment on SPARK-6334 at 3/14/15 11:11 AM:
--------------------------------------------------------------

bq. What are the files that are filling up the disk, shuffle?
yes, it is all the shuffle data.

bq. Did you try the ttl settings?
do you mean spark.cleaner.ttl? yes, but that leads to loss of data required 
later and ALS then fails when trying to use it.


was (Author: antonymayi):
bq. What are the files that are filling up the disk, shuffle?
yes, it is all the shuffle data.

bq. Did you try the ttl settings?
do you mean spark.cleaner.ttl? yes, but that leads to loss of data required 
later and ALS later then fails when trying to use it.

> spark-local dir not getting cleared during ALS
> ----------------------------------------------
>
>                 Key: SPARK-6334
>                 URL: https://issues.apache.org/jira/browse/SPARK-6334
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Antony Mayi
>         Attachments: als-diskusage.png
>
>
> when running bigger ALS training spark spills loads of temp data into the 
> local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
> on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
> out of space (in my case I have 12TB of available disk capacity before 
> kicking off the ALS but it all gets used (and yarn kills the containers when 
> reaching 90%).
> even with all recommended options (configuring checkpointing and forcing GC 
> when possible) it still doesn't get cleared.
> here is my (pseudo)code (pyspark):
> {code}
> sc.setCheckpointDir('/tmp')
> training = 
> sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
> model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
> sc._jvm.System.gc()
> {code}
> the training RDD has about 3.5 billions of items (~60GB on disk). after about 
> 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
> gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
> 37 executors of 4 cores/28+4GB RAM each.
> this is the graph of disk consumption pattern showing the space being all 
> eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
> !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to