hi -- I git-bisected and found the backwards-compat regression, and reported here
https://issues.apache.org/jira/browse/ARROW-17100 On Wed, Jul 6, 2022 at 4:16 PM Wes McKinney <[email protected]> wrote: > > hi — did you ever resolve this issue? We should try to identify what > is causing this failure and see if it can be fixed for the 9.0.0 > release. > > > On Tue, Jun 14, 2022 at 8:18 AM Niklas Bivald <[email protected]> > wrote: > > > > Hi, > > > > I’m experiencing problem reading parquet files written with the > > `use_dictionary=[]` option in pyarrow 2.0.0. If I write a parquet file in > > 2.0.0 reading it in 8.0.0 gives: > > > > >>> pd.read_parquet(‘dataset.parq') > > > > > Traceback (most recent call last): > > > > > File "<stdin>", line 1, in <module> > > > > > File > > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py", > > > line 493, in read_parquet > > > > > return impl.read( > > > > > File > > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py", > > > line 240, in read > > > > > result = self.api.parquet.read_table( > > > > > File > > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", > > > line 2780, in read_table > > > > > return dataset.read(columns=columns, use_threads=use_threads, > > > > > File > > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", > > > line 2443, in read > > > > > table = self._dataset.to_table( > > > > > File "pyarrow/_dataset.pyx", line 304, in > > > pyarrow._dataset.Dataset.to_table > > > > > File "pyarrow/_dataset.pyx", line 2549, in > > > pyarrow._dataset.Scanner.to_table > > > > > File "pyarrow/error.pxi", line 144, in > > > pyarrow.lib.pyarrow_internal_check_status > > > > > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > > > > > OSError: Unexpected end of stream > > > > > > > It’s easy to replicate (link to sample parquet: > > https://www.dropbox.com/s/portxgch3fpovnz/test2.parq?dl=0 or gist to create > > your own https://gist.github.com/bivald/f93448eaf25808284c4029c691a58a6a) > > with the following schema: > > > > schema = pa.schema([ > > > > > ("col1", pa.int8()), > > > > > ("col2", pa.string()), > > > > > ("col3", pa.float64()), > > > > > ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False)) > > > > > ]) > > > > > > > Actually opening the file as a ParquetFile works (as long as I don’t read > > the row group): > > > > <pyarrow._parquet.FileMetaData object at 0x7f79c134c360> > > > > > created_by: parquet-cpp version 1.5.1-SNAPSHOT > > > > > num_columns: 4 > > > > > num_rows: 5 > > > > > num_row_groups: 1 > > > > > format_version: 2.6 > > > > > serialized_size: 858 > > > > > > > Is there any way to make pyarrow==8.0.0 read these parquet files? Or at > > least figure out a way to convert them from 2 to 8. Not using the > > use_dictionary works, but unfortunately I already have hundreds of > > gigabytes of these parquet files across a lot of environments. > > > > If I write it using pyarrow==3.0.0 I can read it all the way from 3 to > > 8.0.0, but not 2.0.0. > > > > Regards, > > Niklas > > > > Full sample code: > > > > import pyarrow as pa > > > > > import pyarrow.parquet as pq > > > > > > > > > > schema = pa.schema([ > > > > > ("col1", pa.int8()), > > > > > ("col2", pa.string()), > > > > > ("col3", pa.float64()), > > > > > ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False)) > > > > > ]) > > > > > > > > > > table = pa.table([ > > > > > [1, 2, 3, 4, 5], > > > > > ["a", "b", "c", "d", "e"], > > > > > [1.0, 2.0, 3.0, 4.0, 5.0], > > > > > ["a", "a", "a", "b", "b"] > > > > > ], schema=schema) > > > > > > > > > > > > > > > output_file = 'test2.parq' > > > > > > > > > > with pq.ParquetWriter( > > > > > output_file, > > > > > schema, > > > > > compression='snappy', > > > > > allow_truncated_timestamps=True, > > > > > version='2.0', # Highest available schema > > > > > data_page_version='2.0', # Highest available schema > > > > > # Convert these columns to categorical values, must be bytes keys > > > as seen on > > > > > # > > > https://stackoverflow.com/questions/56377848/writing-stream-of-big-data-to-parquet-with-python > > > > > use_dictionary=[category.encode('utf-8') for category in ['col4']], > > > > > ) as writer: > > > > > writer.write_table( > > > > > table, > > > > > row_group_size=10000 > > > > > ) > > > > > > > >
