[GitHub] [arrow] jorisvandenbossche commented on issue #36392: [Python][Parquet] Reading a parquet file containing timedeltas fails if it was written out using fastparquet

via GitHub Thu, 29 Jun 2023 23:35:43 -0700


jorisvandenbossche commented on issue #36392:
URL: https://github.com/apache/arrow/issues/36392#issuecomment-1614196317


   This is a consequence of fastparquet writing the timedeltas as a "time" type 
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#time), 
and so pyarrow also reads this as a time64 type:
   
   ```
   with tempfile.TemporaryDirectory() as tmpdir:
       path = f"{tmpdir}/test.parquet"
       df.to_parquet(path, engine="fastparquet")
       table = pq.read_table(path)
       pq_meta = pq.read_metadata(path)
   
   >>> pq_meta.schema
   <pyarrow._parquet.ParquetSchema object at 0x7fbd2cd20d80>
   required group field_id=-1 schema {
     optional int64 field_id=-1 timedelta (Time(isAdjustedToUTC=true, 
timeUnit=microseconds));
   }
   >>> table.schema
   timedelta: time64[us]
   -- schema metadata --
   pandas: '{"column_indexes": [{"field_name": null, "metadata": null, "name' + 
429
   ```
   
   But on conversion to python/pandas, pyarrow then tries to (expectedly) 
create `datetime.time` objects and not `datetime.timedelta`. And for times, 
those underlying integer values are way too big. If we manually cast to int and 
then to duration (pyarrow's timedelta), we see that it are still the correct 
values:
   
   ```
   >>> table["timedelta"].cast("int64")
   <pyarrow.lib.ChunkedArray object at 0x7fbd26196a20>
   [
     [
       86400000000,
       86400000000,
       691200000000,
       691200000000,
       172800000000,
       172800000000,
       691200000000,
       691200000000,
       691200000000,
       259200000000
     ]
   ]
   
   >>> table["timedelta"].cast("int64").cast("duration[us]").to_pandas()
   0   1 days
   1   1 days
   2   8 days
   3   8 days
   4   2 days
   5   2 days
   6   8 days
   7   8 days
   8   8 days
   9   3 days
   dtype: timedelta64[ns]
   ```
   
   Fastparquet does store metadata about the original pandas dataframe:
   
   ```
   >>> table.schema.pandas_metadata
   {'column_indexes': [{'field_name': None,
      'metadata': None,
      'name': None,
      'numpy_type': 'object',
      'pandas_type': 'mixed-integer'}],
    'columns': [{'field_name': 'timedelta',
      'metadata': None,
      'name': 'timedelta',
      'numpy_type': 'timedelta64[ns]',
      'pandas_type': 'timedelta64'}],
    'creator': {'library': 'fastparquet', 'version': '0.8.3'},
    'index_columns': [{'kind': 'range',
      'name': None,
      'start': 0,
      'step': 1,
      'stop': 10}],
    'pandas_version': '2.1.0.dev0+976.g870a504af9',
    'partition_columns': []}
   ``` 
   
   and this metadata indicates that the original column was timedelta64, and so 
in theory pyarrow _could_ use that information to restore the original pandas 
DataFrame when converting the table to pandas. However, we typically only use 
that metadata information in case we have data where it is unsure what to do 
(and in addition to restore the column/row indices), and in this case we have a 
proper time64 type from the point of view of pyarrow, which has a clear 
non-ambiguous mapping to python (i.e. ``datetime.time``).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #36392: [Python][Parquet] Reading a parquet file containing timedeltas fails if it was written out using fastparquet

Reply via email to