[
https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430572#comment-16430572
]
Antoine Pitrou commented on ARROW-2369:
---------------------------------------
Ok, there are two things going on:
* when {{write_table()}} is called with a filepath string, it goes through
{{PythonFile}}, which is probably inefficient
* {{PythonFile.Seek}} doesn't handle seek offsets greater than 2**32 properly:
{code:python}
>>> f = open('/tmp/empty', 'wb')
>>> f.truncate(1<<33 + 10)
8796093022208
>>> f.close()
>>> f = open('/tmp/empty', 'rb')
>>> paf = pa.PythonFile(f, 'rb')
>>> paf.tell()
0
>>> paf.seek(5)
5
>>> paf.tell()
5
>>> paf.seek(1<<33 + 6)
0
>>> paf.tell()
0
>>> f.seek(1<<33 + 6)
549755813888
>>> f.tell()
549755813888
{code}
> Large (>~20 GB) files written to Parquet via PyArrow are corrupted
> ------------------------------------------------------------------
>
> Key: ARROW-2369
> URL: https://issues.apache.org/jira/browse/ARROW-2369
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.9.0
> Environment: Reproduced on Ubuntu + Mac OSX
> Reporter: Justin Tan
> Assignee: Antoine Pitrou
> Priority: Major
> Labels: Parquet, bug, pandas, parquetWriter, pyarrow
> Fix For: 0.10.0
>
> Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png
>
>
> When writing large Parquet files (above 10 GB or so) from Pandas to Parquet
> via the command
> {{pq.write_table(my_df, 'table.parquet')}}
> The write succeeds, but when the parquet file is loaded, the error message
> {{ArrowIOError: Invalid parquet file. Corrupt footer.}}
> appears. This same error occurs when the parquet file is written chunkwise as
> well. When the parquet files are small, say < 5 GB or so (drawn randomly from
> the same dataset), everything proceeds as normal. I've also tried this with
> Pandas df.to_parquet(), with the same results.
> Update: Looks like any DataFrame with size above ~5GB (on disk) returns the
> same error.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)