[ 
https://issues.apache.org/jira/browse/SPARK-20598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621396#comment-16621396
 ] 

holdenk commented on SPARK-20598:
---------------------------------

Huh that's interesting.I suspect that could be we're keeping the reference 
inside of PySpark.

> Iterative checkpoints do not get removed from HDFS
> --------------------------------------------------
>
>                 Key: SPARK-20598
>                 URL: https://issues.apache.org/jira/browse/SPARK-20598
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, YARN
>    Affects Versions: 2.1.0
>            Reporter: Guillem Palou
>            Priority: Major
>
> I am running a pyspark  application that makes use of dataframe.checkpoint() 
> because Spark needs exponential time to compute the plan and eventually I had 
> to stop it. Using {{checkpoint}} allowed the application to proceed with the 
> computation, but I noticed that the HDFS cluster was filling up with RDD 
> files. Spark is running on YARN client mode. 
> I managed to reproduce the problem in a toy example as below:
> {code}
> df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint()
> for i in range(4):
>     # either line of the following 2 will produce the error   
>     df = df.select('*', F.concat(*df.columns)).cache().checkpoint()
>     df = df.join(df, on='a').cache().checkpoint()
>     # the following two lines do not seem to have an effect
>     gc.collect()
>     sc._jvm.System.gc()
> {code}
> After running the code and {{sc.top()}}, I can still see the rdd's 
> checkpointed in HDFS:
> {quote}
> guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18
> 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6
> {quote}
> The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set 
> to {{true}}. I would expect Spark to clean up all RDDs that can't be 
> accessed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to