hi -- I git-bisected and found the backwards-compat regression, and
reported here

https://issues.apache.org/jira/browse/ARROW-17100

On Wed, Jul 6, 2022 at 4:16 PM Wes McKinney <[email protected]> wrote:
>
> hi — did you ever resolve this issue? We should try to identify what
> is causing this failure and see if it can be fixed for the 9.0.0
> release.
>
>
> On Tue, Jun 14, 2022 at 8:18 AM Niklas Bivald <[email protected]> 
> wrote:
> >
> > Hi,
> >
> > I’m experiencing problem reading parquet files written with the
> > `use_dictionary=[]` option in pyarrow 2.0.0. If I write a parquet file in
> > 2.0.0 reading it in 8.0.0 gives:
> >
> > >>> pd.read_parquet(‘dataset.parq')
> > >
> > Traceback (most recent call last):
> > >
> >   File "<stdin>", line 1, in <module>
> > >
> >   File
> > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py",
> > > line 493, in read_parquet
> > >
> >     return impl.read(
> > >
> >   File
> > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py",
> > > line 240, in read
> > >
> >     result = self.api.parquet.read_table(
> > >
> >   File
> > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py",
> > > line 2780, in read_table
> > >
> >     return dataset.read(columns=columns, use_threads=use_threads,
> > >
> >   File
> > > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py",
> > > line 2443, in read
> > >
> >     table = self._dataset.to_table(
> > >
> >   File "pyarrow/_dataset.pyx", line 304, in
> > > pyarrow._dataset.Dataset.to_table
> > >
> >   File "pyarrow/_dataset.pyx", line 2549, in
> > > pyarrow._dataset.Scanner.to_table
> > >
> >   File "pyarrow/error.pxi", line 144, in
> > > pyarrow.lib.pyarrow_internal_check_status
> > >
> >   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> > >
> > OSError: Unexpected end of stream
> > >
> >
> > It’s easy to replicate (link to sample parquet:
> > https://www.dropbox.com/s/portxgch3fpovnz/test2.parq?dl=0 or gist to create
> > your own https://gist.github.com/bivald/f93448eaf25808284c4029c691a58a6a)
> > with the following schema:
> >
> > schema = pa.schema([
> > >
> >     ("col1", pa.int8()),
> > >
> >     ("col2", pa.string()),
> > >
> >     ("col3", pa.float64()),
> > >
> >     ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))
> > >
> > ])
> > >
> >
> > Actually opening the file as a ParquetFile works (as long as I don’t read
> > the row group):
> >
> > <pyarrow._parquet.FileMetaData object at 0x7f79c134c360>
> > >
> >   created_by: parquet-cpp version 1.5.1-SNAPSHOT
> > >
> >   num_columns: 4
> > >
> >   num_rows: 5
> > >
> >   num_row_groups: 1
> > >
> >   format_version: 2.6
> > >
> >   serialized_size: 858
> > >
> >
> > Is there any way to make pyarrow==8.0.0 read these parquet files? Or at
> > least figure out a way to convert them from 2 to 8. Not using the
> > use_dictionary works, but unfortunately I already have hundreds of
> > gigabytes of these parquet files across a lot of environments.
> >
> > If I write it using pyarrow==3.0.0 I can read it all the way from 3 to
> > 8.0.0, but not 2.0.0.
> >
> > Regards,
> > Niklas
> >
> > Full sample code:
> >
> >  import pyarrow as pa
> > >
> >  import pyarrow.parquet as pq
> > >
> >
> > >
> >  schema = pa.schema([
> > >
> >      ("col1", pa.int8()),
> > >
> >      ("col2", pa.string()),
> > >
> >      ("col3", pa.float64()),
> > >
> >      ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))
> > >
> >  ])
> > >
> >
> > >
> >  table = pa.table([
> > >
> >      [1, 2, 3, 4, 5],
> > >
> >      ["a", "b", "c", "d", "e"],
> > >
> >      [1.0, 2.0, 3.0, 4.0, 5.0],
> > >
> >      ["a", "a", "a", "b", "b"]
> > >
> >  ], schema=schema)
> > >
> >
> > >
> >
> > >
> >  output_file = 'test2.parq'
> > >
> >
> > >
> >  with pq.ParquetWriter(
> > >
> >          output_file,
> > >
> >          schema,
> > >
> >          compression='snappy',
> > >
> >          allow_truncated_timestamps=True,
> > >
> >          version='2.0',  # Highest available schema
> > >
> >          data_page_version='2.0',  # Highest available schema
> > >
> >          # Convert these columns to categorical values, must be bytes keys
> > > as seen on
> > >
> >          #
> > > https://stackoverflow.com/questions/56377848/writing-stream-of-big-data-to-parquet-with-python
> > >
> >          use_dictionary=[category.encode('utf-8') for category in ['col4']],
> > >
> >      ) as writer:
> > >
> >          writer.write_table(
> > >
> >              table,
> > >
> >              row_group_size=10000
> > >
> >           )
> > >
> >
> > >

Reply via email to