[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size growth

Micah Kornfield (Jira) Thu, 14 Nov 2019 21:58:54 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974827#comment-16974827
 ]


Micah Kornfield commented on ARROW-7150:
----------------------------------------

[~klichukb] I don't think either Parquet or Avro is going to give you great 
performance if you are simply sticking JSON strings in as data.  If you want 
smaller file sizes, two columns for parquet, one with field name and one with 
field value (boolean) would do a lot better.   Unfortunately, the Arrow library 
doesn't support writing nested data yet (otherwise list<struct<field : string, 
value : bool>) would be preferred.

 

I opened https://issues.apache.org/jira/browse/ARROW-7174 which if implemented 
and there is a lot of duplication in strings might still get useful 
compression.  For now I think this is working as intended.

> [Python] Explain parquet file size growth
> -----------------------------------------
>
>                 Key: ARROW-7150
>                 URL: https://issues.apache.org/jira/browse/ARROW-7150
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: Mac OS X
>            Reporter: Bogdan Klichuk
>            Priority: Major
>         Attachments: 820.parquet
>
>
> Having columnar storage format in mind, with gzip compression enabled, I 
> can't make sense of how parquet file size is growing in my specific example.
> So far without sharing a dataset (would need to create a mock one to share).
> {code:java}
> > # 1. read 820 rows from a parquet file
> > df.read_parquet('820.parquet')
> > # size of 820.parquet is 528K
> > len(df)
> 820
> > # 2. write 8200 rows to a parquet file
> > df_big = pandas.concat([df] * 10).reset_index(drop=True)
> > len(df_big)
> 8200
> > df_big.to_parquet('8200.parquet', compression='gzip')
> > # size of 800.parquet is 33M. Why is it 60 times bigger?
>  {code}
>   
> Compression works better on bigger files. How come 10x1 increase with 
> repeated data resulted in 60x growth of file? Insane imo.
>  
> Working on a periodic job that concats smaller files into bigger ones and 
> doubting now whether I need this.
>  
> I attached 820.parquet to try out



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size growth

Reply via email to