[ 
https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974068#comment-16974068
 ] 

Micah Kornfield commented on ARROW-7150:
----------------------------------------

Note total data-size for 8200 appears to be ~300MB

 

It looks like dictionary encoding might only be getting applied to the first 
page (see parquet-tools command below).  Based on the parquet this might be 
expected ..
{quote}The dictionary encoding builds a dictionary of values encountered in a 
given column. The dictionary will be stored in a dictionary page per column 
chunk. The values are stored as integers using the [RLE/Bit-Packing 
Hybrid|https://github.com/apache/parquet-format/blob/master/Encodings.md#RLE] 
encoding. If the dictionary grows too big, whether in size or number of 
distinct values, the encoding will fall back to the plain encoding. The 
dictionary page is written first, before the data pages of the column chunk.
{quote}
 

{{"parquet-tools dump 8200.parquet |head -20"}}

 

 stream TV=8200 RL=0 DL=1 DS: 114 DE:PLAIN_DICTIONARY

    ----------------------------------------------------------------------------

    page 0:                       DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY 
[more]... VC:1024

    page 1:                       DLE:RLE RLE:RLE VLE:PLAIN ST:[no sta 
[more]... VC:1024

    page 2:                       DLE:RLE RLE:RLE VLE:PLAIN ST:[no sta 
[more]... VC:1024

    page 3:                       DLE:RLE RLE:RLE VLE:PLAIN ST:[no sta 
[more]... VC:1024

    page 4:                       DLE:RLE RLE:RLE VLE:PLAIN ST:[no sta 
[more]... VC:1024

    page 5:                       DLE:RLE RLE:RLE VLE:PLAIN ST:[no sta 
[more]... VC:1024

    page 6:                       DLE:RLE RLE:RLE VLE:PLAIN ST:[no sta 
[more]... VC:1024

    page 7:                       DLE:RLE RLE:RLE VLE:PLAIN ST:[no sta 
[more]... VC:1024

    page 8:                       DLE:RLE RLE:RLE VLE:PLAIN ST:[no sta 
[more]... VC:8

> [Python] Explain parquet file size growth
> -----------------------------------------
>
>                 Key: ARROW-7150
>                 URL: https://issues.apache.org/jira/browse/ARROW-7150
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: Mac OS X
>            Reporter: Bogdan Klichuk
>            Priority: Major
>         Attachments: 820.parquet
>
>
> Having columnar storage format in mind, with gzip compression enabled, I 
> can't make sense of how parquet file size is growing in my specific example.
> So far without sharing a dataset (would need to create a mock one to share).
> {code:java}
> > # 1. read 820 rows from a parquet file
> > df.read_parquet('820.parquet')
> > # size of 820.parquet is 528K
> > len(df)
> 820
> > # 2. write 8200 rows to a parquet file
> > df_big = pandas.concat([df] * 10).reset_index(drop=True)
> > len(df_big)
> 8200
> > df_big.to_parquet('8200.parquet', compression='gzip')
> > # size of 800.parquet is 33M. Why is it 60 times bigger?
>  {code}
>   
> Compression works better on bigger files. How come 10x1 increase with 
> repeated data resulted in 60x growth of file? Insane imo.
>  
> Working on a periodic job that concats smaller files into bigger ones and 
> doubting now whether I need this.
>  
> I attached 820.parquet to try out



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to