[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

Xiao Li (Jira) Mon, 23 Dec 2019 15:06:09 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002529#comment-17002529
 ]


Xiao Li commented on SPARK-30316:
---------------------------------

The compression ratio depends on your data layout, instead of number of row. 

> data size boom after shuffle writing dataframe save as parquet
> --------------------------------------------------------------
>
>                 Key: SPARK-30316
>                 URL: https://issues.apache.org/jira/browse/SPARK-30316
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, SQL
>    Affects Versions: 2.4.4
>            Reporter: Cesc 
>            Priority: Major
>
> When I read a same parquet file and then save it in two ways, with shuffle 
> and without shuffle, I found the size of output parquet files are quite 
> different. For example,  an origin parquet file with 800 MB size, if save 
> without shuffle, the size is still 800MB, whereas if I use method repartition 
> and then save it as in parquet format, the data size increase to 2.5GB. Row 
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet 
> file efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

Reply via email to