jorisvandenbossche commented on issue #33321:
URL: https://github.com/apache/arrow/issues/33321#issuecomment-1560826211

   The "pandas metadata" is custom metadata that we store in the pyarrow schema 
whenever the data is created from a pandas.DataFrame:
   
   ```python
   >>> df = pd.DataFrame({"col": pd.date_range("2012-01-01", periods=3, 
freq="D")})
   >>> df
            col
   0 2012-01-01
   1 2012-01-02
   2 2012-01-03
   >>> table = pa.table(df)
   >>> table.schema
   col: timestamp[ns]
   -- schema metadata --
   pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
413
   # easier way to access this (and converted to a dict)
   >>> table.schema.pandas_metadata
   {'index_columns': [{'kind': 'range',
      'name': None,
      'start': 0,
      'stop': 3,
      'step': 1}],
    'column_indexes': [{'name': None,
      'field_name': None,
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': {'encoding': 'UTF-8'}}],
    'columns': [{'name': 'col',
      'field_name': 'col',
      'pandas_type': 'datetime',
      'numpy_type': 'datetime64[ns]',
      'metadata': None}],
    'creator': {'library': 'pyarrow', 'version': '13.0.0.dev106+gfbe5f641d'},
    'pandas_version': '2.1.0.dev0+484.g7187e67500'}
   ```
   
   So this indicates that the original data in the pandas.DataFrame had 
"datetime64[ns]" type. In this case that matches the arrow type, but for 
example after a roundtrip through Parquet, this might no longer be the case:
   
   ```python
   >>> pq.write_table(table, "test.parquet")
   >>> table2 = pq.read_table("test.parquet")
   >>> table2.schema
   Out[33]: 
   col: timestamp[us]                 # <--- now us instead of ns
   -- schema metadata --
   pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
413
   >>>  table2.schema.pandas_metadata
   { ...
    'columns': [{'name': 'col',
      'field_name': 'col',
      'pandas_type': 'datetime',
      'numpy_type': 'datetime64[ns]',   # <--- but this still indicates ns
      'metadata': None}],
   ...
   ```
   
   So the question is here what `table2.to_pandas()` should do? Use the 
microsecond resolution of the data, of the nanosecond resolution of the 
metadata? 
   
   (note that this is also a consequence of the default parquet version we 
write not yet supporting nanoseconds, while we should probably bump that 
default version we write, and then the nanoseconds would be preserved in the 
Parquet roundtrip)
   
   Now, I am not sure if it would be easy to use the information of the pandas 
metadata to influence the conversion, as we typically only use the metadata 
after converting the actual data to finalize the resulting pandas DataFrame (eg 
set the index, cast the column names, ..). 
   And I am also not fully sure if it would actually be desirable to follow the 
pandas metadata, since that would involve an extra conversion step (and 
effectively all existing pandas metadata (eg in already written parquet files) 
will always say that it is nanoseconds, since until recently that was the only 
supported resolution by pandas).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to