Guillem Palou created SPARK-20598:
-------------------------------------
Summary: Iterative checkpoints do not get removed from HDFS
Key: SPARK-20598
URL: https://issues.apache.org/jira/browse/SPARK-20598
Project: Spark
Issue Type: Bug
Components: PySpark, Spark Core, YARN
Affects Versions: 2.1.0
Reporter: Guillem Palou
I am running a pyspark application that makes use of dataframe.checkpoint()
because Spark needs exponential time to compute the plan and eventually I had
to stop it. Using {{checkpoint}} allowed the application to proceed with the
computation, but I noticed that the HDFS cluster was filling up with RDD files.
Spark is running on YARN client mode.
I managed to reproduce the problem in a toy example as below:
{code}
df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint()
for i in range(4):
# either line of the following 2 will produce the error
df = df.select('*', F.concat(*df.columns)).cache().checkpoint()
df = df.join(df, on='a').cache().checkpoint()
# the following two lines do not seem to have an effect
gc.collect()
sc._jvm.System.gc()
{code}
After running the code and {{sc.top()}}, I can still see the rdd's checkpointed
in HDFS:
{quote}
guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH
5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12
5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18
5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24
5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30
5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6
{quote}
The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set to
{{true}}. I would expect Spark to clean up all RDDs that can't be accessed.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]