[
https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974107#comment-16974107
]
Micah Kornfield commented on ARROW-7150:
----------------------------------------
To answer my own question it looks like default dictionary size is set to 1MB
in CPP (and the 114 values is ~4MB). The knob to tune this does not appear to
be currently accessible in python, so after the first page since generally at
least one new value is expected the encoding would fallback to plain encoding.
> [Python] Explain parquet file size growth
> -----------------------------------------
>
> Key: ARROW-7150
> URL: https://issues.apache.org/jira/browse/ARROW-7150
> Project: Apache Arrow
> Issue Type: Task
> Components: Python
> Affects Versions: 0.15.1
> Environment: Mac OS X
> Reporter: Bogdan Klichuk
> Priority: Major
> Attachments: 820.parquet
>
>
> Having columnar storage format in mind, with gzip compression enabled, I
> can't make sense of how parquet file size is growing in my specific example.
> So far without sharing a dataset (would need to create a mock one to share).
> {code:java}
> > # 1. read 820 rows from a parquet file
> > df.read_parquet('820.parquet')
> > # size of 820.parquet is 528K
> > len(df)
> 820
> > # 2. write 8200 rows to a parquet file
> > df_big = pandas.concat([df] * 10).reset_index(drop=True)
> > len(df_big)
> 8200
> > df_big.to_parquet('8200.parquet', compression='gzip')
> > # size of 800.parquet is 33M. Why is it 60 times bigger?
> {code}
>
> Compression works better on bigger files. How come 10x1 increase with
> repeated data resulted in 60x growth of file? Insane imo.
>
> Working on a periodic job that concats smaller files into bigger ones and
> doubting now whether I need this.
>
> I attached 820.parquet to try out
--
This message was sent by Atlassian Jira
(v8.3.4#803005)