[
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001833#comment-17001833
]
Terry Kim commented on SPARK-30316:
-----------------------------------
This is a possible scenario because when you repartition/shuffle the data, the
values you are storing could be reordered such that the compression ratio could
become worse, for example.
> data size boom after shuffle writing dataframe save as parquet
> --------------------------------------------------------------
>
> Key: SPARK-30316
> URL: https://issues.apache.org/jira/browse/SPARK-30316
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle, SQL
> Affects Versions: 2.4.4
> Reporter: Cesc
> Priority: Blocker
>
> When I read a same parquet file and then save it in two ways, with shuffle
> and without shuffle, I found the size of output parquet files are quite
> different. For example, an origin parquet file with 800 MB size, if save
> without shuffle, the size is still 800MB, whereas if I use method repartition
> and then save it as in parquet format, the data size increase to 2.5GB. Row
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet
> file efficiently to avoid data size boom?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]