[ 
https://issues.apache.org/jira/browse/ARROW-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5104:
-----------------------------------------
    Fix Version/s: 0.14.0

> [Python/C++] Schema for empty tables include index column as integer
> --------------------------------------------------------------------
>
>                 Key: ARROW-5104
>                 URL: https://issues.apache.org/jira/browse/ARROW-5104
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.13.0
>            Reporter: Florian Jetter
>            Priority: Minor
>             Fix For: 0.14.0
>
>
> The schema for an empty table/dataframe still includes the index as an 
> integer column instead of being serialized solely as a metadata reference 
> (see ARROW-1639)
> In the example below, the empty dataframe still holds `__index_level_0__` as 
> an integer column. Proper behavior would be to exclude it and reference the 
> index information in the pandas metadata as it is the case for a non-empty 
> column
> {code}
> In [1]: import pandas as pd
> im
> In [2]: import pyarrow as pa
> In [3]: non_empty =  pd.DataFrame({"col": [1]})
> In [4]: empty = non_empty.drop(0)
> In [5]: empty
> Out[5]:
> Empty DataFrame
> Columns: [col]
> Index: []
> In [6]: pa.Table.from_pandas(non_empty)
> Out[6]:
> pyarrow.Table
> col: int64
> metadata
> --------
> OrderedDict([(b'pandas',
>               b'{"index_columns": [{"kind": "range", "name": null, "start": '
>               b'0, "stop": 1, "step": 1}], "column_indexes": [{"name": null,'
>               b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
>               b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
>               b'{"name": "col", "field_name": "col", "pandas_type": "int64",'
>               b' "numpy_type": "int64", "metadata": null}], "creator": {"lib'
>               b'rary": "pyarrow", "version": "0.13.0"}, "pandas_version": nu'
>               b'll}')])
> In [7]: pa.Table.from_pandas(empty)
> Out[7]:
> pyarrow.Table
> col: int64
> __index_level_0__: int64
> metadata
> --------
> OrderedDict([(b'pandas',
>               b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
>               b'{"name": null, "field_name": null, "pandas_type": "unicode",'
>               b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
>               b', "columns": [{"name": "col", "field_name": "col", "pandas_t'
>               b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n'
>               b'ame": null, "field_name": "__index_level_0__", "pandas_type"'
>               b': "int64", "numpy_type": "int64", "metadata": null}], "creat'
>               b'or": {"library": "pyarrow", "version": "0.13.0"}, "pandas_ve'
>               b'rsion": null}')])
> In [8]: pa.__version__
> Out[8]: '0.13.0'
> In [9]: ! python --version
> Python 3.6.7
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to