[ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1897.
---------------------------------
    Resolution: Fixed

Resolved by PR https://github.com/apache/arrow/pull/1397

> [Python] Incorrect numpy_type for pandas metadata of Categoricals
> -----------------------------------------------------------------
>
>                 Key: ARROW-1897
>                 URL: https://issues.apache.org/jira/browse/ARROW-1897
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: Tom Augspurger
>            Assignee: Phillip Cloud
>              Labels: categorical, metadata, pandas, parquet, pyarrow
>             Fix For: 0.8.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>    ...:                   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>    ...:
> In [8]: sink = io.BytesIO()
>    ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>    ...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>    ...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> {code}
> From the spec:
> bq. The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to