[
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002529#comment-17002529
]
Xiao Li commented on SPARK-30316:
---------------------------------
The compression ratio depends on your data layout, instead of number of row.
> data size boom after shuffle writing dataframe save as parquet
> --------------------------------------------------------------
>
> Key: SPARK-30316
> URL: https://issues.apache.org/jira/browse/SPARK-30316
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle, SQL
> Affects Versions: 2.4.4
> Reporter: Cesc
> Priority: Major
>
> When I read a same parquet file and then save it in two ways, with shuffle
> and without shuffle, I found the size of output parquet files are quite
> different. For example, an origin parquet file with 800 MB size, if save
> without shuffle, the size is still 800MB, whereas if I use method repartition
> and then save it as in parquet format, the data size increase to 2.5GB. Row
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet
> file efficiently to avoid data size boom?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]