[ 
https://issues.apache.org/jira/browse/ARROW-11792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291573#comment-17291573
 ] 

Daniel Evans edited comment on ARROW-11792 at 2/26/21, 11:06 AM:
-----------------------------------------------------------------

On some further investigation with pyarrow itself, I can actually read the 
String-typed data (so the title may be misleading), but the other three columns 
fail to read:

dataset = pyarrow.parquet.ParquetDataset(fp)

dataset.read(["damage_ratio_id"])  # No error

dataset.read(["min_hazard_intensity"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 1349, in read
    use_pandas_metadata=use_pandas_metadata)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 781, in read
    table = reader.read(**options)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 384, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1097, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

dataset.read(["max_hazard_intensity"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 1349, in read
    use_pandas_metadata=use_pandas_metadata)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 781, in read
    table = reader.read(**options)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 384, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1097, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.

dataset.read(["damage_ratio"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 1349, in read
    use_pandas_metadata=use_pandas_metadata)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 781, in read
    table = reader.read(**options)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 384, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1097, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.


was (Author: danielevans):
On some further investigation with pyarrow itself, I can actually read the 
String-typed data (so the title may be misleading), but the other three columns 
fail to read:

dataset = pyarrow.parquet.ParquetDataset(fp)

dataset.read(["damage_ratio_id"])  # No error

dataset.read(["max_hazard_intensity"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 1349, in read
    use_pandas_metadata=use_pandas_metadata)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 781, in read
    table = reader.read(**options)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 384, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1097, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.

dataset.read(["min_hazard_intensity"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 1349, in read
    use_pandas_metadata=use_pandas_metadata)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 781, in read
    table = reader.read(**options)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 384, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1097, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

dataset.read(["damage_ratio"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 1349, in read
    use_pandas_metadata=use_pandas_metadata)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 781, in read
    table = reader.read(**options)
  File 
"/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/pyarrow/parquet.py",
 line 384, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1097, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

> PyArrow unable to read file with large string values
> ----------------------------------------------------
>
>                 Key: ARROW-11792
>                 URL: https://issues.apache.org/jira/browse/ARROW-11792
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>         Environment: Scientific Linux 7.9; PyArrow 3.0.0, Pandas 1.0.5
>            Reporter: Daniel Evans
>            Priority: Major
>         Attachments: metadata.json
>
>
> I am having difficulty re-reading a Parquet file written out using Pandas. 
> The error message hints that either the file was malformed on write, or 
> possibly that it is corrupt on disk (hard for me to confirm or deny that 
> option - if there's an easy way for me to check, let me know).
> The original Pandas dataframe consisted of around 50 million rows with four 
> columns. Three columns are simple `float` data, while the fourth is a 
> string-typed column containing long strings, averaging 200 characters. Each 
> string value is present in 20-30 rows, giving around 2 million unique 
> strings. This is currently where my suspicion lies if it is an issue with 
> pyarrow.
> The file was written out with {{df.to_parquet(compression="brotli")}}.
> As well as pyarrow 3.0.0, I have quickly tried 2.0.0 and 1.0.1, both of which 
> fail to read. Re-generating the data and writing takes several hours, 
> annoyingly - a test on a smaller dataset produces a readable file.
> I am able to read the metadata of the file with PyArrow, which looks as I 
> expect. The full metadata is attached in JSON format.
> >>> pyarrow.parquet.read_metadata("builtenv_vulns_bad.parquet")
> <pyarrow._parquet.FileMetaData object at 0x7f8ae91f88e0>
>   created_by: parquet-cpp version 1.5.1-SNAPSHOT
>   num_columns: 4
>   num_rows: 55761732
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 3213
> I can provide the problematic file privately - it's around 250MB.
> {{
> [...snip...]
>     df = pd.read_parquet(data_source, columns=columns)
>   File 
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py", 
> line 312, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py", 
> line 127, in read
>     path, columns=columns, **kwargs
>   File 
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py", 
> line 1704, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py", 
> line 1582, in read
>     use_threads=use_threads
>   File "pyarrow/_dataset.pyx", line 372, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 2266, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
> Deserializing page header failed.
> }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to