[I] Parquet files created from Pandas dataframes with Arrow-backed list columns cannot be read by pd.read_parquet [arrow]

via GitHub Fri, 02 Feb 2024 08:33:14 -0800


cvm-a opened a new issue, #39914:
URL: https://github.com/apache/arrow/issues/39914


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Simple repro:
   
   ```
   import pyarrow as pa
   import pandas as pd
   
pd.DataFrame({"x":pa.array(pd.Series([[2.2]*5]*10)).to_pandas(types_mapper=pd.ArrowDtype)}).to_parquet("/tmp/list4.pqt")
   df2 = pd.read_parquet("/tmp/list4.pqt",  dtype_backend="pyarrow")
   ```
   
   Fails with 
   ```
   File ~/<redacted>/lib/python3.11/site-packages/pandas/io/parquet.py:667, in 
read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, 
dtype_backend, filesystem, filters, **kwargs)
       664     use_nullable_dtypes = False
       665 check_dtype_backend(dtype_backend)
   --> 667 return impl.read(
       668     path,
       669     columns=columns,
       670     filters=filters,
       671     storage_options=storage_options,
       672     use_nullable_dtypes=use_nullable_dtypes,
       673     dtype_backend=dtype_backend,
       674     filesystem=filesystem,
       675     **kwargs,
       676 )
   
   File ~/<redacted>/lib/python3.11/site-packages/pandas/io/parquet.py:281, in 
PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, 
dtype_backend, storage_options, filesystem, **kwargs)
       273 try:
       274     pa_table = self.api.parquet.read_table(
       275         path_or_handle,
       276         columns=columns,
      (...)
       279         **kwargs,
       280     )
   --> 281     result = pa_table.to_pandas(**to_pandas_kwargs)
       283     if manager == "array":
       284         result = result._as_manager("array", copy=False)
   
   File ~/<redacted>lib/python3.11/site-packages/pyarrow/array.pxi:884, in 
pyarrow.lib._PandasConvertible.to_pandas()
   
   File ~/<redacted>/lib/python3.11/site-packages/pyarrow/table.pxi:4251, in 
pyarrow.lib.Table._to_pandas()
   
   File ~/<redacted>/lib/python3.11/site-packages/pyarrow/pandas_compat.py:769, 
in table_to_dataframe(options, table, categories, ignore_metadata, types_mapper)
       766     table = _add_any_metadata(table, pandas_metadata)
       767     table, index = _reconstruct_index(table, index_descriptors,
       768                                       all_columns, types_mapper)
   --> 769     ext_columns_dtypes = _get_extension_dtypes(
       770         table, all_columns, types_mapper)
       771 else:
       772     index = _pandas_api.pd.RangeIndex(table.num_rows)
   
   File ~/<redacted>/lib/python3.11/site-packages/pyarrow/pandas_compat.py:828, 
in _get_extension_dtypes(table, columns_metadata, types_mapper)
       823 dtype = col_meta['numpy_type']
       825 if dtype not in _pandas_supported_numpy_types:
       826     # pandas_dtype is expensive, so avoid doing this for types
       827     # that are certainly numpy dtypes
   --> 828     pandas_dtype = _pandas_api.pandas_dtype(dtype)
       829     if isinstance(pandas_dtype, _pandas_api.extension_dtype):
       830         if hasattr(pandas_dtype, "__from_arrow__"):
   
   File ~/<redacted>/lib/python3.11/site-packages/pyarrow/pandas-shim.pxi:141, 
in pyarrow.lib._PandasAPIShim.pandas_dtype()
   
   File ~/<redacted>/lib/python3.11/site-packages/pyarrow/pandas-shim.pxi:144, 
in pyarrow.lib._PandasAPIShim.pandas_dtype()
   
   File 
~/<redacted>/lib/python3.11/site-packages/pandas/core/dtypes/common.py:1630, in 
pandas_dtype(dtype)
      1625     with warnings.catch_warnings():
      1626         # GH#51523 - Series.astype(np.integer) doesn't show
      1627         # numpy deprecation warning of np.integer
      1628         # Hence enabling DeprecationWarning
      1629         warnings.simplefilter("always", DeprecationWarning)
   -> 1630         npdtype = np.dtype(dtype)
      1631 except SyntaxError as err:
      1632     # np.dtype uses `eval` which can raise SyntaxError
      1633     raise TypeError(f"data type '{dtype}' not understood") from err
   
   TypeError: data type 'list<item: double>[pyarrow]' not understood
   ```
   
   Environment:
   OS: MacOS Darwin Kernel Version 22.1.0
   Python: 3.11.6
   Pandas: 2.2.0
   Pyarrow 15.0.0
   
   The same error is raised even if we use `pd.read_parquet("/tmp/list4.pqt",  
dtype_backend="numpy_nullable")`
   
   The non-arrow backed column version
   
   ```
   import pyarrow as pa
   import pandas as pd
   pd.DataFrame({"x":pd.Series([[2.2]*5]*10)}).to_parquet("/tmp/list2.pqt")
   df2 = pd.read_parquet("/tmp/list2.pqt",  dtype_backend="pyarrow")
   ```
   
   is read back correctly, and the column is Arrow backed in the new dataframe, 
so it doesn't survive a further round trip.
   
   I did some further digging at the Parquet Pandas metadata, and found that 
for the Parquet written from the Arrow-based table, we have 
   ```
   {'index_columns': [{'kind': 'range',
      'name': None,
      'start': 0,
      'stop': 10,
      'step': 1}],
    'column_indexes': [{'name': None,
      'field_name': None,
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': {'encoding': 'UTF-8'}}],
    'columns': [{'name': 'x',
      'field_name': 'x',
      'pandas_type': 'list[float64]',
      'numpy_type': 'list<element: double>[pyarrow]',
      'metadata': None}],
    'creator': {'library': 'pyarrow', 'version': '15.0.0'},
    'pandas_version': '2.2.0'}
   ```
   whereas for the numpy-based dataframe, the output is:
   ```
   {'index_columns': [{'kind': 'range',
      'name': None,
      'start': 0,
      'stop': 10,
      'step': 1}],
    'column_indexes': [{'name': None,
      'field_name': None,
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': {'encoding': 'UTF-8'}}],
    'columns': [{'name': 'x',
      'field_name': 'x',
      'pandas_type': 'list[float64]',
      'numpy_type': 'object',
      'metadata': None}],
    'creator': {'library': 'pyarrow', 'version': '15.0.0'},
    'pandas_version': '2.2.0'}
   ```
   
   The problem seems to be caused by the numpy_type for the arrow case being 
set to 'list<element: double>[pyarrow]' rather than object or a numpy array 
type.
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Parquet files created from Pandas dataframes with Arrow-backed list columns cannot be read by pd.read_parquet [arrow]

Reply via email to