[ 
https://issues.apache.org/jira/browse/ARROW-11792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292758#comment-17292758
 ] 

Daniel Evans commented on ARROW-11792:
--------------------------------------

I've re-run the file generation over the weekend, and it appears that a valid 
file has been generated. It therefore seems that this may have been a file 
corruption issue, rather than a bug - feel free to close it off unless you 
suspect that there was an intermittent issue with file writing.

> PyArrow unable to read file with large string values
> ----------------------------------------------------
>
>                 Key: ARROW-11792
>                 URL: https://issues.apache.org/jira/browse/ARROW-11792
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>         Environment: Scientific Linux 7.9; PyArrow 3.0.0, Pandas 1.0.5
>            Reporter: Daniel Evans
>            Priority: Major
>         Attachments: metadata.json
>
>
> I am having difficulty re-reading a Parquet file written out using Pandas. 
> The error message hints that either the file was malformed on write, or 
> possibly that it is corrupt on disk (hard for me to confirm or deny that 
> option - if there's an easy way for me to check, let me know).
> The original Pandas dataframe consisted of around 50 million rows with four 
> columns. Three columns are simple `float` data, while the fourth is a 
> string-typed column containing long strings, averaging 200 characters. Each 
> string value is present in 20-30 rows, giving around 2 million unique 
> strings. This is currently where my suspicion lies if it is an issue with 
> pyarrow.
> The file was written out with {{df.to_parquet(compression="brotli")}}.
> As well as pyarrow 3.0.0, I have quickly tried 2.0.0 and 1.0.1, both of which 
> fail to read. Re-generating the data and writing takes several hours, 
> annoyingly - a test on a smaller dataset produces a readable file.
> I am able to read the metadata of the file with PyArrow, which looks as I 
> expect. The full metadata is attached in JSON format.
> >>> pyarrow.parquet.read_metadata("builtenv_vulns_bad.parquet")
> <pyarrow._parquet.FileMetaData object at 0x7f8ae91f88e0>
>   created_by: parquet-cpp version 1.5.1-SNAPSHOT
>   num_columns: 4
>   num_rows: 55761732
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 3213
> I can provide the problematic file privately - it's around 250MB.
> {{
> [...snip...]
>     df = pd.read_parquet(data_source, columns=columns)
>   File 
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py", 
> line 312, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py", 
> line 127, in read
>     path, columns=columns, **kwargs
>   File 
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py", 
> line 1704, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py", 
> line 1582, in read
>     use_threads=use_threads
>   File "pyarrow/_dataset.pyx", line 372, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 2266, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
> Deserializing page header failed.
> }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to