[jira] [Commented] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted

ASF GitHub Bot (JIRA) Thu, 12 Apr 2018 05:12:19 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435447#comment-16435447
 ]


ASF GitHub Bot commented on ARROW-2369:
---------------------------------------

xhochy closed pull request #1866: ARROW-2369: [Python] Fix reading large 
Parquet files (> 4 GB)
URL: https://github.com/apache/arrow/pull/1866
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/io.cc b/cpp/src/arrow/python/io.cc
index 36c193dbfa..155e86f828 100644
--- a/cpp/src/arrow/python/io.cc
+++ b/cpp/src/arrow/python/io.cc
@@ -65,14 +65,16 @@ class PythonFile {
 
   Status Seek(int64_t position, int whence) {
     // whence: 0 for relative to start of file, 2 for end of file
-    PyObject* result = cpp_PyObject_CallMethod(file_, "seek", "(ii)", 
position, whence);
+    PyObject* result = cpp_PyObject_CallMethod(file_, "seek", "(ni)",
+                                               
static_cast<Py_ssize_t>(position), whence);
     Py_XDECREF(result);
     PY_RETURN_IF_ERROR(StatusCode::IOError);
     return Status::OK();
   }
 
   Status Read(int64_t nbytes, PyObject** out) {
-    PyObject* result = cpp_PyObject_CallMethod(file_, "read", "(i)", nbytes);
+    PyObject* result =
+        cpp_PyObject_CallMethod(file_, "read", "(n)", 
static_cast<Py_ssize_t>(nbytes));
     PY_RETURN_IF_ERROR(StatusCode::IOError);
     *out = result;
     return Status::OK();
diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py
index beeedca032..34aa55a9c2 100644
--- a/python/pyarrow/parquet.py
+++ b/python/pyarrow/parquet.py
@@ -60,7 +60,6 @@ class ParquetFile(object):
     """
     def __init__(self, source, metadata=None, common_metadata=None):
         self.reader = ParquetReader()
-        source = _ensure_file(source)
         self.reader.open(source, metadata=metadata)
         self.common_metadata = common_metadata
         self._nested_paths_by_prefix = self._build_nested_paths()
diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py
index b29b9f1f6e..1511600777 100644
--- a/python/pyarrow/tests/test_io.py
+++ b/python/pyarrow/tests/test_io.py
@@ -21,6 +21,7 @@
 import os
 import pytest
 import sys
+import tempfile
 import weakref
 
 import numpy as np
@@ -30,6 +31,25 @@
 from pyarrow.compat import u, guid
 import pyarrow as pa
 
+
+def check_large_seeks(file_factory):
+    if sys.platform in ('win32', 'darwin'):
+        pytest.skip("need sparse file support")
+    try:
+        filename = tempfile.mktemp(prefix='test_io')
+        with open(filename, 'wb') as f:
+            f.truncate(2 ** 32 + 10)
+            f.seek(2 ** 32 + 5)
+            f.write(b'mark\n')
+        with file_factory(filename) as f:
+            assert f.seek(2 ** 32 + 5) == 2 ** 32 + 5
+            assert f.tell() == 2 ** 32 + 5
+            assert f.read(5) == b'mark\n'
+            assert f.tell() == 2 ** 32 + 10
+    finally:
+        os.unlink(filename)
+
+
 # ----------------------------------------------------------------------
 # Python file-like objects
 
@@ -83,6 +103,13 @@ def test_python_file_read():
     f.close()
 
 
+def test_python_file_large_seeks():
+    def factory(filename):
+        return pa.PythonFile(open(filename, 'rb'))
+
+    check_large_seeks(factory)
+
+
 def test_bytes_reader():
     # Like a BytesIO, but zero-copy underneath for C++ consumers
     data = b'some sample data'
@@ -544,6 +571,10 @@ def test_os_file_reader(sample_disk_data):
     _check_native_file_reader(pa.OSFile, sample_disk_data)
 
 
+def test_os_file_large_seeks():
+    check_large_seeks(pa.OSFile)
+
+
 def _try_delete(path):
     try:
         os.remove(path)
@@ -600,6 +631,10 @@ def test_memory_zero_length(tmpdir):
         assert memory_map.size() == 0
 
 
+def test_memory_map_large_seeks():
+    check_large_seeks(pa.memory_map)
+
+
 def test_os_file_writer(tmpdir):
     SIZE = 4096
     arr = np.random.randint(0, 256, size=SIZE).astype('u1')


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Large (>~20 GB) files written to Parquet via PyArrow are corrupted
> ------------------------------------------------------------------
>
>                 Key: ARROW-2369
>                 URL: https://issues.apache.org/jira/browse/ARROW-2369
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>         Environment: Reproduced on Ubuntu + Mac OSX
>            Reporter: Justin Tan
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: Parquet, bug, pandas, parquetWriter, 
> pull-request-available, pyarrow
>             Fix For: 0.10.0
>
>         Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png
>
>
> When writing large Parquet files (above 10 GB or so) from Pandas to Parquet 
> via the command
> {{pq.write_table(my_df, 'table.parquet')}}
> The write succeeds, but when the parquet file is loaded, the error message
> {{ArrowIOError: Invalid parquet file. Corrupt footer.}}
> appears. This same error occurs when the parquet file is written chunkwise as 
> well. When the parquet files are small, say < 5 GB or so (drawn randomly from 
> the same dataset), everything proceeds as normal. I've also tried this with 
> Pandas df.to_parquet(), with the same results.
> Update: Looks like any DataFrame with size above ~5GB (on disk) returns the 
> same error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted

Reply via email to