[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

Terry Kim (Jira) Sat, 21 Dec 2019 21:59:09 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001833#comment-17001833
 ]


Terry Kim commented on SPARK-30316:
-----------------------------------

This is a possible scenario because when you repartition/shuffle the data, the 
values you are storing could be reordered such that the compression ratio could 
become worse, for example.  

> data size boom after shuffle writing dataframe save as parquet
> --------------------------------------------------------------
>
>                 Key: SPARK-30316
>                 URL: https://issues.apache.org/jira/browse/SPARK-30316
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, SQL
>    Affects Versions: 2.4.4
>            Reporter: Cesc 
>            Priority: Blocker
>
> When I read a same parquet file and then save it in two ways, with shuffle 
> and without shuffle, I found the size of output parquet files are quite 
> different. For example,  an origin parquet file with 800 MB size, if save 
> without shuffle, the size is still 800MB, whereas if I use method repartition 
> and then save it as in parquet format, the data size increase to 2.5GB. Row 
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet 
> file efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

Reply via email to