hi — did you ever resolve this issue? We should try to identify what is causing this failure and see if it can be fixed for the 9.0.0 release.
On Tue, Jun 14, 2022 at 8:18 AM Niklas Bivald <[email protected]> wrote: > > Hi, > > I’m experiencing problem reading parquet files written with the > `use_dictionary=[]` option in pyarrow 2.0.0. If I write a parquet file in > 2.0.0 reading it in 8.0.0 gives: > > >>> pd.read_parquet(‘dataset.parq') > > > Traceback (most recent call last): > > > File "<stdin>", line 1, in <module> > > > File > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py", > > line 493, in read_parquet > > > return impl.read( > > > File > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py", > > line 240, in read > > > result = self.api.parquet.read_table( > > > File > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", > > line 2780, in read_table > > > return dataset.read(columns=columns, use_threads=use_threads, > > > File > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", > > line 2443, in read > > > table = self._dataset.to_table( > > > File "pyarrow/_dataset.pyx", line 304, in > > pyarrow._dataset.Dataset.to_table > > > File "pyarrow/_dataset.pyx", line 2549, in > > pyarrow._dataset.Scanner.to_table > > > File "pyarrow/error.pxi", line 144, in > > pyarrow.lib.pyarrow_internal_check_status > > > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > > > OSError: Unexpected end of stream > > > > It’s easy to replicate (link to sample parquet: > https://www.dropbox.com/s/portxgch3fpovnz/test2.parq?dl=0 or gist to create > your own https://gist.github.com/bivald/f93448eaf25808284c4029c691a58a6a) > with the following schema: > > schema = pa.schema([ > > > ("col1", pa.int8()), > > > ("col2", pa.string()), > > > ("col3", pa.float64()), > > > ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False)) > > > ]) > > > > Actually opening the file as a ParquetFile works (as long as I don’t read > the row group): > > <pyarrow._parquet.FileMetaData object at 0x7f79c134c360> > > > created_by: parquet-cpp version 1.5.1-SNAPSHOT > > > num_columns: 4 > > > num_rows: 5 > > > num_row_groups: 1 > > > format_version: 2.6 > > > serialized_size: 858 > > > > Is there any way to make pyarrow==8.0.0 read these parquet files? Or at > least figure out a way to convert them from 2 to 8. Not using the > use_dictionary works, but unfortunately I already have hundreds of > gigabytes of these parquet files across a lot of environments. > > If I write it using pyarrow==3.0.0 I can read it all the way from 3 to > 8.0.0, but not 2.0.0. > > Regards, > Niklas > > Full sample code: > > import pyarrow as pa > > > import pyarrow.parquet as pq > > > > > > schema = pa.schema([ > > > ("col1", pa.int8()), > > > ("col2", pa.string()), > > > ("col3", pa.float64()), > > > ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False)) > > > ]) > > > > > > table = pa.table([ > > > [1, 2, 3, 4, 5], > > > ["a", "b", "c", "d", "e"], > > > [1.0, 2.0, 3.0, 4.0, 5.0], > > > ["a", "a", "a", "b", "b"] > > > ], schema=schema) > > > > > > > > > output_file = 'test2.parq' > > > > > > with pq.ParquetWriter( > > > output_file, > > > schema, > > > compression='snappy', > > > allow_truncated_timestamps=True, > > > version='2.0', # Highest available schema > > > data_page_version='2.0', # Highest available schema > > > # Convert these columns to categorical values, must be bytes keys > > as seen on > > > # > > https://stackoverflow.com/questions/56377848/writing-stream-of-big-data-to-parquet-with-python > > > use_dictionary=[category.encode('utf-8') for category in ['col4']], > > > ) as writer: > > > writer.write_table( > > > table, > > > row_group_size=10000 > > > ) > > > > >
