[
https://issues.apache.org/jira/browse/ARROW-11792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292758#comment-17292758
]
Daniel Evans commented on ARROW-11792:
--------------------------------------
I've re-run the file generation over the weekend, and it appears that a valid
file has been generated. It therefore seems that this may have been a file
corruption issue, rather than a bug - feel free to close it off unless you
suspect that there was an intermittent issue with file writing.
> PyArrow unable to read file with large string values
> ----------------------------------------------------
>
> Key: ARROW-11792
> URL: https://issues.apache.org/jira/browse/ARROW-11792
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 3.0.0
> Environment: Scientific Linux 7.9; PyArrow 3.0.0, Pandas 1.0.5
> Reporter: Daniel Evans
> Priority: Major
> Attachments: metadata.json
>
>
> I am having difficulty re-reading a Parquet file written out using Pandas.
> The error message hints that either the file was malformed on write, or
> possibly that it is corrupt on disk (hard for me to confirm or deny that
> option - if there's an easy way for me to check, let me know).
> The original Pandas dataframe consisted of around 50 million rows with four
> columns. Three columns are simple `float` data, while the fourth is a
> string-typed column containing long strings, averaging 200 characters. Each
> string value is present in 20-30 rows, giving around 2 million unique
> strings. This is currently where my suspicion lies if it is an issue with
> pyarrow.
> The file was written out with {{df.to_parquet(compression="brotli")}}.
> As well as pyarrow 3.0.0, I have quickly tried 2.0.0 and 1.0.1, both of which
> fail to read. Re-generating the data and writing takes several hours,
> annoyingly - a test on a smaller dataset produces a readable file.
> I am able to read the metadata of the file with PyArrow, which looks as I
> expect. The full metadata is attached in JSON format.
> >>> pyarrow.parquet.read_metadata("builtenv_vulns_bad.parquet")
> <pyarrow._parquet.FileMetaData object at 0x7f8ae91f88e0>
> created_by: parquet-cpp version 1.5.1-SNAPSHOT
> num_columns: 4
> num_rows: 55761732
> num_row_groups: 1
> format_version: 1.0
> serialized_size: 3213
> I can provide the problematic file privately - it's around 250MB.
> {{
> [...snip...]
> df = pd.read_parquet(data_source, columns=columns)
> File
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py",
> line 312, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py",
> line 127, in read
> path, columns=columns, **kwargs
> File
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py",
> line 1704, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py",
> line 1582, in read
> use_threads=use_threads
> File "pyarrow/_dataset.pyx", line 372, in pyarrow._dataset.Dataset.to_table
> File "pyarrow/_dataset.pyx", line 2266, in pyarrow._dataset.Scanner.to_table
> File "pyarrow/error.pxi", line 122, in
> pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
> Deserializing page header failed.
> }}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)