Bogdan Klichuk created ARROW-7150: ------------------------------------- Summary: [Python] Explain parquet file size growth Key: ARROW-7150 URL: https://issues.apache.org/jira/browse/ARROW-7150 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 0.14.1 Environment: Mac OS X. Pyarrow==0.15.1 Reporter: Bogdan Klichuk
Having columnar storage format in mind, with gzip compression enabled, I can't make sense of how parquet file size is growing in my specific example. So far without sharing a dataset (would need to create a mock one to share). {code:java} > df = pandas.read_csv('...') > len(df) 820 > df.to_parquet('820.parquet', compression='gzip) > # size of 820.parquet is 6.1M > df_big = pandas.concat([df] * 10).reset_index(drop=True) > len(df_big) 8200 > df_big.to_parquet('8200.parquet', compression='gzip') > # size of 800.parquet is 320M. {code} Compression works better on bigger files. How come 10x1 increase with repeated data resulted in 50x growth of file? Insane imo. Working on a periodic job that concats smaller files into bigger ones and doubting now whether I need this. -- This message was sent by Atlassian Jira (v8.3.4#803005)