[jira] [Commented] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted

ASF GitHub Bot (JIRA) Thu, 12 Apr 2018 05:13:23 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435450#comment-16435450
 ]


ASF GitHub Bot commented on ARROW-2369:
---------------------------------------

pitrou commented on a change in pull request #1866: ARROW-2369: [Python] Fix 
reading large Parquet files (> 4 GB)
URL: https://github.com/apache/arrow/pull/1866#discussion_r181058227
 
 

 ##########
 File path: cpp/src/arrow/python/io.cc
 ##########
 @@ -65,14 +65,16 @@ class PythonFile {
 
   Status Seek(int64_t position, int whence) {
     // whence: 0 for relative to start of file, 2 for end of file
-    PyObject* result = cpp_PyObject_CallMethod(file_, "seek", "(ii)", 
position, whence);
+    PyObject* result = cpp_PyObject_CallMethod(file_, "seek", "(ni)",
 
 Review comment:
   Ultimately it is the same format string syntax as documented here: 
https://docs.python.org/3/c-api/arg.html#building-values

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Large (>~20 GB) files written to Parquet via PyArrow are corrupted
> ------------------------------------------------------------------
>
>                 Key: ARROW-2369
>                 URL: https://issues.apache.org/jira/browse/ARROW-2369
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>         Environment: Reproduced on Ubuntu + Mac OSX
>            Reporter: Justin Tan
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: Parquet, bug, pandas, parquetWriter, 
> pull-request-available, pyarrow
>             Fix For: 0.10.0
>
>         Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png
>
>
> When writing large Parquet files (above 10 GB or so) from Pandas to Parquet 
> via the command
> {{pq.write_table(my_df, 'table.parquet')}}
> The write succeeds, but when the parquet file is loaded, the error message
> {{ArrowIOError: Invalid parquet file. Corrupt footer.}}
> appears. This same error occurs when the parquet file is written chunkwise as 
> well. When the parquet files are small, say < 5 GB or so (drawn randomly from 
> the same dataset), everything proceeds as normal. I've also tried this with 
> Pandas df.to_parquet(), with the same results.
> Update: Looks like any DataFrame with size above ~5GB (on disk) returns the 
> same error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted

Reply via email to