Assaf Mendelson created SPARK-19046:
---------------------------------------
Summary: Dataset checkpoint consumes too much disk space
Key: SPARK-19046
URL: https://issues.apache.org/jira/browse/SPARK-19046
Project: Spark
Issue Type: Bug
Components: SQL
Reporter: Assaf Mendelson
Consider the following simple example:
val df = spark.range(100000000)
df.cache()
df.count()
df.checkpoint()
df.write.parquet("/test1")
Looking at the storage tab of the UI, the dataframe takes 97.5 MB.
Looking at the checkpoint directory, the checkpoint takes 3.3GB (33 times
larger!)
Looking at the parquet directory, the dataframe takes 386MB
Similar behavior can be seen on less synthetic examples.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]