Hi,
I’m experiencing problem reading parquet files written with the
`use_dictionary=[]` option in pyarrow 2.0.0. If I write a parquet file in
2.0.0 reading it in 8.0.0 gives:
>>> pd.read_parquet(‘dataset.parq')
>
Traceback (most recent call last):
>
File "<stdin>", line 1, in <module>
>
File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py",
> line 493, in read_parquet
>
return impl.read(
>
File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py",
> line 240, in read
>
result = self.api.parquet.read_table(
>
File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py",
> line 2780, in read_table
>
return dataset.read(columns=columns, use_threads=use_threads,
>
File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py",
> line 2443, in read
>
table = self._dataset.to_table(
>
File "pyarrow/_dataset.pyx", line 304, in
> pyarrow._dataset.Dataset.to_table
>
File "pyarrow/_dataset.pyx", line 2549, in
> pyarrow._dataset.Scanner.to_table
>
File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
>
OSError: Unexpected end of stream
>
It’s easy to replicate (link to sample parquet:
https://www.dropbox.com/s/portxgch3fpovnz/test2.parq?dl=0 or gist to create
your own https://gist.github.com/bivald/f93448eaf25808284c4029c691a58a6a)
with the following schema:
schema = pa.schema([
>
("col1", pa.int8()),
>
("col2", pa.string()),
>
("col3", pa.float64()),
>
("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))
>
])
>
Actually opening the file as a ParquetFile works (as long as I don’t read
the row group):
<pyarrow._parquet.FileMetaData object at 0x7f79c134c360>
>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
>
num_columns: 4
>
num_rows: 5
>
num_row_groups: 1
>
format_version: 2.6
>
serialized_size: 858
>
Is there any way to make pyarrow==8.0.0 read these parquet files? Or at
least figure out a way to convert them from 2 to 8. Not using the
use_dictionary works, but unfortunately I already have hundreds of
gigabytes of these parquet files across a lot of environments.
If I write it using pyarrow==3.0.0 I can read it all the way from 3 to
8.0.0, but not 2.0.0.
Regards,
Niklas
Full sample code:
import pyarrow as pa
>
import pyarrow.parquet as pq
>
>
schema = pa.schema([
>
("col1", pa.int8()),
>
("col2", pa.string()),
>
("col3", pa.float64()),
>
("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))
>
])
>
>
table = pa.table([
>
[1, 2, 3, 4, 5],
>
["a", "b", "c", "d", "e"],
>
[1.0, 2.0, 3.0, 4.0, 5.0],
>
["a", "a", "a", "b", "b"]
>
], schema=schema)
>
>
>
output_file = 'test2.parq'
>
>
with pq.ParquetWriter(
>
output_file,
>
schema,
>
compression='snappy',
>
allow_truncated_timestamps=True,
>
version='2.0', # Highest available schema
>
data_page_version='2.0', # Highest available schema
>
# Convert these columns to categorical values, must be bytes keys
> as seen on
>
#
> https://stackoverflow.com/questions/56377848/writing-stream-of-big-data-to-parquet-with-python
>
use_dictionary=[category.encode('utf-8') for category in ['col4']],
>
) as writer:
>
writer.write_table(
>
table,
>
row_group_size=10000
>
)
>
>