Justin Tan created ARROW-2369:
---------------------------------
Summary: Large (>~20 GB) files written to Parquet via PyArrow are
corrupted
Key: ARROW-2369
URL: https://issues.apache.org/jira/browse/ARROW-2369
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.9.0
Environment: Reproduced on Ubuntu + Mac OSX
Reporter: Justin Tan
Fix For: 0.9.0
Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png
When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via
the command
{{pq.write_table(my_df, 'table.parquet')}}
The write succeeds, but when the parquet file is loaded, the error message
{{ArrowIOError: Invalid parquet file. Corrupt footer.}}
appears. This same error occurs when the parquet file is written chunkwise as
well. When the parquet files are small, say < 10 GB or so (drawn randomly from
the same dataset), everything proceeds as normal.
Details:
Arrow v0.9.0
Reproduced on Ubuntu, Mac osx
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)